Post CrowdStrike, we must find calm in the chaos
Share on socials
Post CrowdStrike, we must find calm in the chaos
Matt Saunders
7th August, 2024
7 min read
Matt Saunders
7th August, 2024
7 min read
A code deployment of CrowdStrike Falcon on Friday 19 July 2024 caused some 8.5 million devices across many industries to be rendered effectively useless, severely disrupting many lives and creating headline news across the planet. As soon as the cause of the incident was identified and things started to return to normal, it was inevitable in the immediate aftermath that the world would rally around to investigate how things could have been done better.
Without full details of what actually went wrong, these are often speculative 'hot takes' that oversimplify or make drastic assumptions about software architecture, CrowdStrike or Microsoft, or the processes used in developing, testing, and operating complex software. But we can already be sure that there are learnings we can take away based on sound contemporary software delivery principles applicable to any organisation trying to deploy software.
A frequent and justifiable riposte to the incident asks, 'why was this update not tested?'. At this stage, we don't know if it was—and it's possible that a hot-fix was deployed rapidly, which bypassed the usual testing processes. So often, our testing regimes aren't good—perhaps they are very slow or have tests that repeatedly or always fail. Writing and using unit tests is not enough; the real comfort here comes from solid integration tests which ensure that newly-developed or patched software works well with other components.
Given an urgent requirement, possibly one that is security-related, it's tempting to deploy without testing. Takeaways here include making sure that the test process is simple, rapid, and free of false positives. Doing this ensures that developers can have confidence in the process and know that tests have their backs.
Hear our experts discuss the CrowdStrike Falcon incident
Tune into our DevOps Decrypted podcast: Ep.27—DevRel—humanising DevOps, with insight from Google's Jennifer Davis, who reminds us that, at the core of the CrowdStrike issue, there are people.
Adaptavist works with many organisations to build software to be deployed onto their own servers—often web applications that run in a user's browser. And in these cases it's possible and desirable to have a seamless and rapid deployment process based around the principles of Continuous Delivery. Small, distinct, and frequent deployments are proven to reduce risk, deliver value to customers faster, and make recovery from a bad deployment simpler.
However, software intended to run on other organisations' computers—for example, on reservation terminals at airports or on embedded devices in healthcare—presents new problems. At Adaptavist, we see this with the development of plugins and add-ons such as ScriptRunner in the Atlassian ecosystem. These are designed to run not only in cloud environments but also on our customers' servers with varied hardware specifications for our 'Data Center' products.
The CrowdStrike Falcon software is deployed to a very wide array of different hardware—and testing all these combinations is both costly and time-consuming. Similarly, software intended to be deployed to mobile phones is faced with a huge cartesian matrix of possible combinations. Testing is generally carried out on virtual servers—this is both faster and simpler than using physical devices, but can lead to some incompatibilities being missed. Equally, just testing manually is not realistic. Finding the right balance between automated and manual testing in this situation is vital.
So, once the software is built and thoroughly tested, we move to the deployment phase. We always advise a progressive or 'blue/green' rollout of software. But in the case of software that runs on other people's computers this often isn't possible. Carrier-grade network routers often ship with the ability to deploy new software to a resilient control plane, the idea being that if the update doesn't work, then the device can change back to working software to prevent downtime, and we see this echoed in software architecture with load balancers configurable to avoid sending traffic to broken instances or pods. We mirror this with techniques such as progressive rollouts, and doing this using contemporary orchestrators such as Kubernetes and modern serverless platforms makes this seamless. The types of devices that CrowdStrike Falcon runs on do not support this level of resilience, which means scrutiny over testing when you have limited control of the end-user device is absolutely essential.
There are further learnings to come when the full details of how the software interacts with the operating system kernel come to light—but we can see some valid takeaways already. Ensuring loose coupling of components, documented API contracts between components, and use of circuit-breakers can mitigate the impact of a bad deployment. The impact of the interaction between CrowdStrike Falcon and Windows caused computers to crash entirely, meaning that just redeploying a known good version of Falcon wasn't possible. This is an experience common with operating systems with security models that allow direct access to the operating system's kernel, and we don't yet fully understand why this was the case. But our takeaway is to use minimum possible privileges when running software to minimise the damage a bad version can cause.
Having a well-thought-out incident response process is also vital. The full details of why the CrowdStrike Falcon update was pushed to all devices at 6 am is unknown, but we must accept that sometimes that absolutely has to happen. And if the rollout fails, it's critical that customers can receive updates, remedial actions, and effective communication from the software vendor.
Adaptavist is in the business of helping organisations across the globe work better—through software, services, and techniques that make a difference in how these organisations operate and how software can be delivered quickly, reliably, and safely. There are still many questions over the CrowdStrike incident; we don't have all the answers now; much commentary, including our own, is speculative, and it's likely that the full extent of the issues that caused such a dramatic outcome can't be completely mitigated. But we do know sound, solid principles of software engineering, testing, deployment, and incident response, and we can help.
Speak to our DevOps experts
Written by
Matt Saunders
DevOps Lead
From a background as a Linux sysadmin, Matt is an authority in all things DevOps. At Adaptavist and beyond, he champions DevOps ways of working, helping teams maximise people, process and technology to deliver software efficiently and safely.