A bug in a testing tool improperly validated a content update to CrowdStrike’s Falcon software, causing “problematic content” in a file to be sent out to millions of Windows devices and kicking off a global crash of computer systems whose ramifications are still being felt.
In a preliminary Post Incident Review (PIR) published Wednesday, the cybersecurity company said its Content Validator tool – which is part of a multi-step process for delivering content configuration updates to the Falcon Sensor – two template instances delivering Rapid Response Content. One of the instances contained the problem content, but based on the results of checks earlier in the process, “these instances were deployed into production,” CrowdStrike wrote.
The result was that when the sensor received the content in Channel File 291 and loaded it into the Content Interpreter, it led to an out-of-bounds memory read that triggered an unexpected exception that “could not be gracefully handled, resulting in a Windows operating system crash” and the Blue Screen of Death (BSOD).
Microsoft said 8.5 million Windows systems worldwide were affected by the outage, which hit many industries like travel, health care and financial services and led to grounded planes, postponed surgeries and delayed banking operations. Some businesses are still recovering, with Delta Airlines continuing to cancel flights five days later.
The outage also hammered CrowdStrike’s share price and gave cybercriminals an opening to run scams playing off the crash.
In addition, there now is Congressional scrutiny, with the heads of the House of Representatives Committee on Homeland Security and Subcommittee on Cybersecurity and Infrastructure Protection sending a letter this week giving CrowdStrike CEO George Kurtz until Wednesday to schedule a hearing with the subcommittee.
“While we appreciate CrowdStrike’s response and coordination with stakeholders, we cannot ignore the magnitude of this incident, which some have claimed is the largest IT outage in history,” wrote Mark Green (R-TN), Homeland Security Committee chairman, and Andrew Garbarino (R-NY), subcommittee chairman. “Recognizing that Americans will undoubtedly feel the lasting, real-world consequences of this incident, they deserve to know in detail how this incident happened and the mitigation steps CrowdStrike is taking.”
Problem Tied to Rapid Response Update
CrowdStrike’s PIR gives a detailed breakdown of what went wrong on July 19, noting that Windows hosts running version 7.11 of CrowdStrike’s sensor that were online during a 90-minute stretch that morning received the problem update. Systems that came online after or were not connected during the period were not affected. In addition, Mac and Linux systems were not impacted.
CrowdStrike sends out configuration updates to security content to its sensors in two ways, including via Sensor Content that is shipped directly with the sensor. Such content isn’t dynamically updated from the cloud and includes Template Types, which have pre-defined fields written in code that threat detection engineers can use in Rapid Response Content. Template Types “go through an extensive QA process, which includes automated testing, manual testing, validation and rollout steps,” the vendor wrote.
Rapid Response Content is used to quickly respond to changes in the threat landscape and can be dynamically updated outside of the Falcon sensor. The content is delivered as Template Instances that map to specific behaviors for the senor to see, detect, or prevent and have a set of fields that match the desired behavior. There are three key systems used to test and deploy Rapid Response Content, including the Content Configuration System, which creates Template Instances and includes the Content Validator.
The other two are the Content Interpreter on the sensor, which enables the Sensor Detection Engine to detect and prevent malicious activity.
Trust in Tests of Previous Updates
CrowdStrike wrote that, in February, it introduced a new InterProcessCommunication (IPC) Template Type to detect novel attack techniques. The next month, the IPC Template Type was run through a stress test in a staging environment that includes operating systems and workloads. After passing the stress test and being validated for use, the IPC Template Instance and subsequently three others were deployed in April and performed as expected in production.
Two additional IPC Template Instances were deployed on July 19 – including the one with the problematic content – based on the tests run before the initial deployment of the Template Type in March, the checks performed in the Content Validator, and the fact that the previous IPC Template Instances were deployed without triggering problems in Windows systems, CrowdStrike wrote.
Steps Being Taken
In response, the vendor said it is improving the Rapid Response Content testing by using additional testing types – including local developer testing and content update and rollback testing – adding more validation checks to Content Validator for Rapid Response Content, and improving the existing error handling in Content Interpreter.
There also are new checks in the deployment of Rapid Response Content, including gradually deploying updates, improving monitoring and giving users greater control over when and where the updates are deployed.