The Crowdstrike Incident
Overview
In July 2024, a faulty software update from CrowdStrike, a leading cybersecurity company, caused widespread system crashes on millions of Windows computers globally. This incident disrupted critical sectors such as transportation, healthcare, and finance, highlighting the vulnerabilities inherent in modern IT systems. This article gives us explanations about the technical backgrounds which are leading finally to the incident.
What is CrowdStrike?
CrowdStrike is the name of a publicly traded US company that provides information security and cybersecurity solutions. The company was founded in 2011 and has also assisted in investigating various cases of economic cyber espionage, including on behalf of the United States Department of Justice.[1] Google is one of the main investors, and the company now has a market capitalization of about 65 billion US dollars.[2] In 2013, the software product CrowdStrike Falcon was launched. CrowdStrike Falcon is a widely used Enterprise Detection and Response (EDR) protection software against cyber attacks. [3] The product uses AI and machine learning technologies, among others, to ensure the protection of computer systems.
The Incident
On July 19, 2024, CrowdStrike released a update — Channel File 291 — for its Falcon Sensor software. This update introduced an additional input parameter (21 instead of the expected 20), which caused an out-of-bounds memory access. The error led to system crashes characterized by the “Blue Screen of Death” (BSOD) on affected Windows systems.
Technical Background
CrowdStrike Falcon
Like any software product, CrowdStrike Falcon requires maintenance. The central security scanner agent is called “CrowdStrike Falcon Sensor” by CrowdStrike. This needs constant updates to be able to respond appropriately to current threat situations and security vulnerabilities.[3] During operation, software updates are regularly rolled out using so-called Channel files. CrowdStrike distributes dynamic updates and detection rules using the Channel files.[3]
Chronology
On July 19, 2024, numerous system failures were detected worldwide. It was quickly determined that the affected systems were using CrowdStrike Falcon as protection software. However, this incident was not a classic cyberattack. Rather, this error occurred in connection with a regular standard procedure - the software update process.
In February 2024, a new feature was rolled out for the “Falcon Sensor”. This new function was intended to improve the detection of threats and attacks using Windows mechanisms. As part of internal stress tests, Rapid Response Updates (designation for special updates for Falcon) were tested. No abnormalities were detected here. The tests took place in March and April 2024.
On July 19, 2024, the Rapid Response Content Updates were then deployed to certain Windows hosts. Specifically, this was Channel File 291. During the root cause analysis, it was determined that the sensor expected 20 input fields, but the update provided 21 input fields. This led to a memory access outside the specified memory addresses, which ultimately resulted in a system crash - the Blue Screen of Death (BSOD).[4]
Software Architecture of Falcon
We now want to discuss the technical background of the Rapid Response Updates. For further understanding, here is an explanation of how the Falcon Sensor software works. The CrowdStrike Falcon Sensor is an advanced security system that uses artificial intelligence and machine learning to detect and defend against malicious activities on computers. It continuously collects data from protected devices and analyzes it to identify suspicious behaviors. Both known and new attack patterns are detected. To stay up-to-date, the sensor’s protection mechanisms are regularly updated and improved through insights from the analysis of real threats.[4]
An important component of the sensor is the so-called “Rapid Response Content”, which is deployed in the form of ChannelFile updates. These are flexible rules that allow the system to quickly respond to new threats without requiring manual adjustment of the system. These rules are regularly updated and enable the sensor to adapt to constantly changing threat landscapes.[4]
The new function rolled out in February 2024 - “IPC Template Type” - is intended to enable the analysis of certain Windows interprocess communications (“IPC”). IPC Template instances were deployed as Rapid Response Content under Channel File 291 as intended. In that - now known as faulty update Channel File - 21 input parameters were defined, whereas the sensor software expected 20. The error was not detected during the test routines because wildcards were used for the 21st input here.[4]
What happened specifically on July 19? Two additional IPC Template instances were rolled out. One of these had a non-wildcard criterion as the 21st input parameter. In previous versions, the 21st input parameter was not used and was not intended. Sensors that now received the new version performed a comparison between the input parameters. The system interpreter expected 20 but received 21. The attempted access to the 21st value caused an out-of-bounds memory write access outside the data field and resulted in a system crash.[4]
In this context, it’s important to understand why an out-of-bounds error led to a complete system crash. The reason lies in the system architecture of the Falcon Agent. It doesn’t run as a separately isolated process, but instead accesses the requested memory resources directly through the Windows kernel as a kernel process. From a technical perspective, Falcon uses a device driver that operates in kernel mode - also known as “Ring Zero”. This allows Falcon full access to operating system resources such as Windows components from Microsoft itself. Falcon has an integrated quarantine function. This function inspects suspicious and potentially problematic files in an isolated area and makes them inaccessible to the operating system.[5]
However, in this case, this function led to a race condition - the faulty file was loaded at startup because it was a software-specific component. As a result, the agent essentially overtook itself, leading to a Blue Screen of Death (BSOD).[6]
Political Background
Why was such a malfunction even possible? Direct kernel access is normally not intended for third-party software. The fact that this access was possible stems from a European Commission regulation from 2009. This regulation mandated access to the operating system kernel for third-party software. The background of this decision was Microsoft’s market power and, in this context, to ensure the competitiveness of other companies.[5][7]
Critics argued that Microsoft took the easy way out here. Apple’s MacOS operating system was not affected because an API was implemented instead of a direct access method. Through this API, third-party providers can use kernel functions. Microsoft could now use this incident as a political issue to argue against such decisions. On the other hand, critics see a need for action from Microsoft. They could also improve security in their own system through better implementations.[5][7]
The incident clearly illustrates the tension between ensuring fair competition and the need to guarantee the security and stability of critical systems.
Impact
The error led to widespread global system failures. All Windows systems using the Falcon Sensor were affected. Here, the faulty update resulted in a Blue Screen of Death (BSOD). An estimated 8.5 million systems were impacted. This represents less than 1% of worldwide Windows systems.
According to CrowdStrike, systems that installed the channel file “C-00000291*.sys” with a timestamp of 04:09 UTC on 19.07.2024 were affected. If the file had a timestamp of 05:27 UTC or later, it was a corrected version that was not affected. Linux or MacOS systems were also not impacted. Microsoft, however, reported that the first outages could be observed as early as 19:00 UTC on 18.07.2024.[3]
What were the concrete impacts? Worldwide (North America, Europe, parts of Asia) there were outages at airports, flights could not take place as planned. Health-critical systems such as hospital information systems were also affected. As a result, medically necessary operations could no longer be performed, and in some cases hospitals continued to operate in emergency mode. Outages of payment systems, cash register systems and ATMs were also observed. The TV channel Sky News had to temporarily cease operations.[8]
Initial Actions and Closing the Vulnerability
As an initial measure, the affected systems had to be manually started in safe mode. Then, the faulty file in the CrowdStrike system directory had to be deleted via the command line. Since the affected systems themselves were no longer bootable, these interventions had to be carried out in lengthy manual individual work - machine by machine. Remote desktop access was also no longer possible. If BitLocker is active, the corresponding key must also be known, otherwise a secure boot is not possible.[9]
The corrected ChannelFile update was rolled out just over an hour later (timestamp of the update 05:27 UTC or later). Systems that were offline at the time the faulty ChannelFile was rolled out, and thus received the correct version right away, were naturally also not affected by the system failure.
Similar Vulnerabilities in the Past
To better understand the significance of the CrowdStrike incident, it’s helpful to compare it with other major cybersecurity events in recent history:
- WannaCry Ransomware Attack (2017) The WannaCry ransomware attack in May 2017 was one of the most devastating cyber incidents in recent years. It exploited a vulnerability in Windows operating systems and spread rapidly across the globe, encrypting files and demanding ransom payments in Bitcoin. The attack affected over 200,000 computers in 150 countries, severely impacting critical infrastructure such as the UK’s National Health Service (NHS). WannaCry highlighted the crucial need for timely software updates and robust cybersecurity practices to mitigate ransomware risks.[10]
- SolarWinds Breach (2020) Discovered in December 2020, the SolarWinds breach was a sophisticated cyber espionage campaign attributed to a state-sponsored group. Attackers compromised the Orion software platform, used by thousands of organizations worldwide, by injecting malicious code into routine software updates. This allowed them to access networks of numerous high-profile targets, including U.S. government agencies and Fortune 500 companies. The SolarWinds incident underscored vulnerabilities in software supply chains and the need for enhanced security measures against sophisticated attacks.[10]
While the CrowdStrike incident was not a cyberattack but rather a software malfunction, its impact was comparable to these major security breaches. All three incidents highlight the critical importance of rigorous testing, timely updates, and robust patch management practices. They also demonstrate how vulnerabilities in widely-used software can have far-reaching consequences across multiple sectors and organizations.[10]
The Importance of Updates in General
Updates are crucial for IT security as they fix vulnerabilities in software that could be exploited by attackers. Security gaps arising from missing updates allow hackers to infiltrate systems, steal data, and cause damage. Updates also improve the stability and performance of software, fix bugs, and ensure systems remain protected against newer threats. Without regular updates, systems remain vulnerable to attack techniques targeting unprotected weaknesses.
Despite their importance, missing or delayed updates are one of the most common causes of security vulnerabilities. In many cases, patches are not installed in a timely manner either due to convenience or lack of resources. This affects both individuals and companies, where inefficient patch management or the absence of a clear update procedure can lead to critical security updates being overlooked or applied with delay.
Another issue is the use of outdated software that is no longer supported with current updates, as well as the incorrect configuration of update mechanisms. Therefore, the regular and correct application of updates remains one of the most important measures for avoiding security risks.
Practical Part and Experiment
Now we want to demonstrate how those mentioned effects finally leads to the incident.
You find the full experiment here: https://wiki.elvis.science/index.php?title=The_Crowdstrike_Incident_-_Practical_Part
Summary
The CrowdStrike incident in July 2024 was a widespread IT outage caused by a faulty software update from the cybersecurity provider CrowdStrike. This update led to millions of Windows systems worldwide crashing and becoming unbootable. The cause was an error in the program logic that resulted in faulty memory access. The consequences were severe: planes couldn’t take off, payment systems failed, hospitals faced significant disruptions, and numerous businesses had to temporarily cease operations. Analysis of this incident highlights the importance of thorough software quality assurance and the necessity of contingency plans for such events. Companies must ensure that their IT systems are resilient against such disruptions and that they have effective procedures for error correction and recovery. The CrowdStrike incident also underscores the far-reaching impacts that a single software error can have. In an increasingly digitized world, IT systems have become a critical component of our infrastructure. Failures of such systems can cause substantial economic damage and affect civil society as a whole. It is therefore crucial that companies and organizations continuously monitor their IT systems. This allows malfunctions to be detected early and appropriate countermeasures (e.g., switching to other systems) to be initiated in time to prevent major damage.
Conclusions
Although the problem was quickly resolved, the incident highlighted the deficiencies in software development and quality assurance that can occur even in large companies. The focus is on the trend to bring software to market as quickly as possible. Thorough software testing is sometimes neglected in this process. Software engineers should focus on the following measures to reduce the likelihood of similar incidents occurring in the future:
- More comprehensive testing: Software must be tested more thoroughly before release to detect and fix errors early.
- Contingency plans: Companies need robust plans for IT failures that should be regularly practiced.
- Patch management solutions: These tools help efficiently manage software updates and minimize risks.
- Virtual desktops and containers: These technologies can mitigate the effects of software errors by creating an isolated environment for applications.[11]
References
[02] https://www.finanzen.net/fundamentalanalyse/crowdstrike
[04] https://www.crowdstrike.com/falcon-content-update-remediation-and-guidance-hub/
[06] https://www.theregister.com/2024/07/23/crowdstrike_failure_shows_need_for/
[10] https://www.ijfmr.com/papers/2024/4/25310.pdf
[12] https://app.pluralsight.com/ilx/video-courses/clips/13aeb538-8039-44b5-b246-226901becc68