As everyone likely knows by now, CrowdStrike, a Texas-based cybersecurity company focused on Microsoft computers, triggered a massive worldwide outage due to an update it pushed out last night. The impact was broad, affecting a number of industries:
- Health care organizations (Harris Health System in Houston had to suspend visits)
- Airlines (flight delays and cancellations across airports and major airlines)
- Access to bank accounts (Australia and New Zealand)
- Stock exchange non-trading services (London)
- Electronic cash register malfunctions impacting stores and restaurants (McDonalds)
- Law enforcement agency systems and emergency response systems (like 911)
- Courts (Maryland’s court system shut down)
Even some television station broadcasting and billboards in Times Square went dark!
This was caused by a “simple” real-time update to millions of machines to address a hacking threat. A bug in the update triggered a conflict with the underlying Windows operating system, preventing the OS from
loading and causing the “blue screen of death” in some machines. For the machines that aren’t getting the blue screen, an automatic fix was rolled out that will be applied to back out the bad patch. For the systems getting the blue screen, IT professionals must fix the machine manually, which will be a time-consuming remedy.
From an insurance industry point of view, there is no way at this time to predict worldwide potential business interruption claims. But the impact is expected to be enormous. These losses will be passed onto reinsurers.
From a technology point of view, this episode highlights the vulnerabilities in modern interconnected software implementations. As Insurance Journal put it, “accelerated by the COVID-19 pandemic, governments and businesses alike have become increasingly dependent on a handful of interconnected technology companies over the past two decades, which explains why one software issue rippled far and wide.”
Insurers are using a small number of cloud providers like Amazon, Microsoft, and Google to host their applications, data warehouses, and infrastructure. Cloud-native services from these providers are providing critical components of new modern applications. Even in legacy environments, mainframes and cloud have become intertwined as hybrid cloud becomes more common. In fact, even the latest AI technologies are dependent on a small number of vendors like OpenAI, Microsoft, or Anthropic (excluding the open-source toolkits for small language models). Consequently, a patch that is applied to these environments could take down hundreds of thousands of applications and millions of users.
Patches are also coming out ever more frequently, and being applied in real-time, as the Crowdstrike patch was. Weekly (or even more often) patches from major software and firmware suppliers are not uncommon.
The bottom line is change management. Change management has been critical to IT availability and performance reliability for as long as IT has been around; the issue now is that technology worldwide relies on a small number of foundational ecosystems. Rapid deployment of patches, software upgrades, and changes must be effectively tested. A/B testing is also crucial to monitor the results of a release. Any change needs to have a flawless backout process if bugs or errors emerge, not just a backout process for the “happy path.” Full rollout should only happen if success criteria are achieved, and only then in waves or phases, just in case something was missed.
Beyond change management specifically, insurance carriers need to have a risk diversification strategy for their technology. This, again, is fundamental. It brings me back to an experience I had in the 1990s at AIG, when I had to crawl into a sewer to direct city workers addressing a flooded sewer pipe–which also happened to contain all of one of AIG’s core NYC building’s network connectivity. This happened because the building did not have local exchange carrier diversity (LEC), hence the “backup” connection terminated in the street at the same place as the primary connection, even though they came from different spots coming out of the building. No risk assessment of this configuration had been done.
Carriers today are in a similar situation: they must conduct risk analysis on their entire enterprise, especially when everything runs through the same cloud, utilizes the same software (like Crowdstrike) across platforms, or relies on the same model, algorithm, or LLM.
Finally, all carriers need to have a comprehensive business continuity and disaster recovery plan in place to address worst case scenarios. We have learned this before, from September 11, 2001 through the Covid-19 pandemic of 2020-2022. Business operations should be able to adapt and adjust if a core technology like the cloud environment, security software, shared infrastructure, the network, the internet, or anything else goes down. The disaster recovery process should address recovery time objectives and recovery point objectives for each business and each technology component. In this case, a copy of each machine’s images should be restored to a recovery point objective of 1 week ago with a recovery time objective of 1 hour. It should be scripted and automated, triggered by logs that indicate machines are not coming up on the network.
The only thing we know is that this will happen again, either on purpose or accidentally. The only thing we don’t know is how the next worldwide outage will be triggered.
If you’d like to discuss your organization’s risk profile or recovery plans further, please don’t hesitate to reach out.
The post Lessons from the Crowdstrike Meltdown appeared first on Datos Insights.