BLOG CrowdStrike's Falcon: When Safety Fails in Security
David Schmid
  • Author: David Schmid
  • Date: 21.07.2024 Last Update: 21.07.2024, 22:13
  • Categories: Cybersecurity

On July 19, 2024, CrowdStrike experienced an outage considered one of the largest IT failures in history. The outage was caused by "a defect found in a Falcon content update for Windows hosts." While CrowdStrike assures this wasn't a cyberattack, the incident underscores a crucial, often overlooked aspect of digital protection: safety.

Understanding the impact of the CrowdStrike outage requires distinguishing between "safety" and "security." While security prevents malicious breaches, safety ensures systems run reliably without unexpected failures. Both are critical and neglecting either can lead to breaches of confidentiality, availability and integrity.

In enterprise environments, security measures are paramount to fend off cyber threats, but safety should be equally prioritized. Software designed to mitigate security risks can sometimes introduce new safety risks. For users, whether an outage stems from a cyberattack or a system failure, the result is the same: inaccessible services.

Initial forensic analyses indicate a null-pointer error caused system crashes and reboot loops. This highlights the importance of thorough testing for system-critical software and suggests CrowdStrike should improve their quality assurance processes. Adopting memory-safe programming languages like Rust could help prevent similar issues in the future.

Despite assurances that the outage wasn’t a cyberattack, the distinction between safety and security is not comforting in the case of a null-pointer error. Such memory safety lapses can lead to both system failures and security breaches, blurring the lines between technical faults and security issues.

Blog CrowdStrike, Main Image

Table of Contents

Safety vs. Security: What's the Difference?

Security focuses on protecting systems, networks and data from malicious attacks. It encompasses measures that guard against unauthorized access, breaches and theft. Think of it as the digital equivalent of having locks on your doors and a security alarm system to deter burglars.

Safety, on the other hand, pertains to the reliable functioning and resilience of systems. It involves ensuring that systems can withstand and recover from failures, maintaining their availability and integrity. In essence, safety is like ensuring your home has fire alarms and sturdy structures to protect against internal mishaps.

To illustrate further, consider the aviation industry, where both safety and security measures are essential to providing a safe and secure flight. Security protocols include passenger screenings, passport checks and anti-terrorism measures to protect against threats. Safety measures involve aircraft maintenance, emergency exit designs and seatbelt usage, ensuring that even in unforeseen accidents, passengers are protected.

Note: The distinction between safety and security may not always be clear in other languages. For instance, in German, the word "Sicherheit" encompasses both concepts, which can lead to confusion. It's essential to recognize this nuance when discussing these topics in different linguistic contexts. This is especially true in international contexts where precise terminology impacts digital protection policies. Misunderstandings can result in inadequate measures being implemented, affecting both system reliability and protection against threats.

CrowdStrike's Falcon: A Closer Look

The global technology outage on July 19, 2024, grounded flights, disrupted health services, crashed payment systems and blocked access to Microsoft services in one of the largest IT failures in history.

The disruptions originated from CrowdStrike, a cybersecurity firm providing security software to a wide range of industries. CrowdStrike’s primary products are designed to block hackers and malware. The Falcon Sensor continuously monitors endpoints to detect vulnerabilities in real-time, identifying weaknesses and potential entry points that attackers could exploit. Falcon acts as a digital gatekeeper for systems, much like airport security for passengers. Just as every passenger must go through a security scan to board a plane, Falcon inspects all online traffic, ensuring that only safe and authorized data passes through. However, if Falcon fails, nothing goes through, creating a bottleneck.

An update to its Falcon Sensor software malfunctioned, causing significant issues on Windows computers. Affected machines were forced into a boot loop, rendering them unusable. The downtime had a widespread global impact.

This incident underscores the critical balance between improved security and the potential risks associated with reliance on such systems. Implementing Falcon provides robust threat protection but also means that any failure can lead to significant operational disruptions.

Image Crowdstrike Headquarters
Crowdstrike Headquarters (Credit: iStock, Sundry Photography)

The Technical Cause: A Null-Pointer Error

Initial forensic analyses of the Blue Screen of Death (BSOD) memory dumps, as outlined by Zach Vorhies on X, suggest that the issue stems from a null-pointer error in CrowdStrike's CSAgent.sys driver (see also here). A null pointer is a special value used to indicate that a pointer or reference does not refer to a valid object. Because a null pointer does not point to a meaningful object, attempting to dereference (i.e., access the data stored at that memory location) a null pointer usually causes a run-time error or immediate program crash.

Therefore, it is crucial to check for null pointers before accessing objects or their properties, but this check was evidently omitted in the CrowdStrike outage. Given the driver's privileged access, this error caused a complete system crash. The affected computers were stuck in a reboot loop, requiring manual repair through safe mode.

Null-pointer dereference errors are particularly problematic in C++ because the language allows direct memory access, making it the developers responsibility to ensure that pointers are valid before use. Without proper checks, these errors can lead to severe system instability and crashes, as seen in this incident. This highlights the importance of rigorous testing and validation, especially for system-critical software like security drivers.

Image Crowdstrike Bluescreen
Blue Screen of Death (BSOD)

The Critical Role of Memory Safety

Memory safety ensures a program only accesses intended memory locations, preventing unpredictable behavior, crashes and exploitable vulnerabilities. This is especially critical in languages like C and C++, where direct memory access requires careful management.

Memory safety is vital for both system reliability (safety) and security. Failures in maintaining memory safety, as seen in the CrowdStrike incident, compromise system stability and can be exploited by attackers. The US National Cybersecurity Strategy, highlighted in "Memory Safety: A Key to Robust Cybersecurity Strategies?", states that up to 70% of security flaws in traditional languages are due to memory safety issues. Promoting memory-safe languages like Rust and Go is key to proactively preventing these vulnerabilities.

The CrowdStrike incident’s root cause — a memory safety violation — demonstrates the link between safety and security. Despite assurances that the outage wasn’t a cyberattack, this distinction is not entirely comforting. Memory safety lapses, such as the null-pointer error in the CSAgent.sys driver, can lead to both system failures and security breaches, blurring the lines between technical faults and security issues.

In conclusion, memory safety is crucial for both system reliability and security. The CrowdStrike incident underscores that neglecting memory safety can have severe consequences, reinforcing the need for memory-safe programming practices to ensure robust and secure systems.

Lessons Learned

It's easy to blame the programmer who made the specific error, but bugs are an inevitable part of software development. Every programmer produces bugs and it’s normal for complex systems to encounter issues under certain conditions. What’s crucial is implementing measures to catch and address these bugs before they reach production.

This incident underscores the importance of thorough testing, especially for system-critical software like security drivers. CrowdStrike will need to review and improve their quality assurance processes to prevent similar incidents in the future. Moreover, deploying an update for such critical software to millions of computers simultaneously, rather than using a staggered rollout, raises questions about their deployment strategy.

The choice of programming language can also play a significant role. C++ allows direct memory access, making it the programmer's responsibility to ensure pointers are valid before use. As mentioned above, adopting memory-safe programming languages like Rust and Go might be a strategic move to improve software safety and security.

In conclusion, while individual programming errors are unavoidable, robust testing, careful deployment strategies and the use of memory-safe languages can significantly mitigate the risks of such errors leading to major system failures and security vulnerabilities. This incident should serve as a wake-up call for the industry to prioritize these practices to improve both safety and security.

Conclusion

The CrowdStrike outage on July 19, 2024, highlights the often-overlooked aspect of safety within digital protection. While the issue wasn't a cyberattack, the null-pointer error caused widespread system crashes and reboot loops, emphasizing the importance of thorough testing and quality assurance for system-critical software. This incident underscores the need for businesses to balance security measures with safety protocols to ensure reliable operations and consider adopting memory-safe programming languages to mitigate similar risks in the future. Memory safety can significantly reduce vulnerabilities, making systems more resilient to both accidental failures and malicious attacks.

Comments

No comment on this post yet... Initiate the dialogue - be the first to illuminate this page with your thoughts!

Leave a Comment

Please preserve the rules of respect and avoid any shadow that might fall upon the realm. Keep your discourse pure and use simple characters. Your scroll shall contain no more than a thousand characters.

Only the worthy may share their wisdom beneath the sacred tree of insight. To prove yourself a true hero and not a shadowy automation, solve this puzzle:

captcha