In almost every IT manager's worst nightmare, their Friday begins with their phone pinging earlier than their alarm, with alerts and notifications telling them that all of their systems are offline and they're in for a pretty terrible end to their week. Well, today, Friday 19th July 2024, that nightmare became a reality.
In a dramatic unfolding of events, a global IT crisis has struck countless businesses and critical infrastructure systems, causing severe disruptions. This crisis, triggered by a faulty software update from cybersecurity company CrowdStrike, has led to widespread computer crashes worldwide.
What Caused The Outage?
The issue originated from an update to CrowdStrike’s Falcon Sensor, a key component of their Endpoint Detection and Response (EDR) platform. Typically, this system monitors computers to detect and counteract cyber threats effectively. However, the recent update introduced a serious flaw that caused many Windows-operated computers to crash and display the infamous Blue Screen of Death (BSOD), leaving them unable to reboot.
Who is Affected By The CrowdStrike Outage?
The impact of this malfunction is massive, affecting various sectors across the globe. Notably, Sky News reported a shutdown in broadcast operations, while airports and railway systems in the US and the UK experienced significant operational delays and interruptions, with some planes told they'll be unable to touch down and others told they won't be able to leave the ground. The ripple effect of the outage also reached numerous companies, causing shutdown, productivity losses, and a return to cash payments, which only highlights the interconnected nature of modern digital infrastructures and the critical role cybersecurity software plays in their stability.
How To Fix The CrowdStrike Outage
With Microsoft Windows machines being largely impacted by CrowdStrike's shoddy code deployment, causing Blue Screens of Death in every direction you look on Friday morning, CrowdStrike tried to move quickly to reverse the damage. As the problem escalated, CrowdStrike responded by rolling back the disruptive update.
However, systems that had already installed the faulty update faced persistent issues. In an effort to mitigate the damage, CrowdStrike issued a workaround involving booting Windows in Safe Mode, navigating to the system drivers, and manually deleting the problematic file identified by the filename pattern “C-00000291*.sys”.
While initial reports focused on a dodgy update, a user named Brody, who is director of CrowdStrike Overwatch posted on X, formerly Twitter that it is “a faulty channel file, so not quite an update.”
There is a potential manual fix, he outlined:
- Boot Windows into Safe Mode or WRE.
- Go to C:\Windows\System32\drivers\CrowdStrike
- Locate and delete file matching "C-00000291*.sys"
- Boot normally.
What To Do Next
This situation has posed a massive challenge for IT departments globally, especially those managing large networks of computers. The workaround, while effective, requires manual application to each affected system, a process that is not only time-consuming but also impractical for organisations with extensive IT infrastructure.
Addressing the issue effectively involves several steps, which may pose challenges due to the manual nature of the required fix. The workaround, while straightforward, is not designed for easy scaling across multiple systems. In large organisations, applying the fix system-by-system could lead to prolonged downtime.
The problem becomes even more complex for systems caught in continuous reboot loops. The need for manual intervention on each system means that system administrators will need considerable time to apply the necessary fixes, as remote updates to resolve the issue aren’t feasible with CrowdStrike’s current capabilities.
While it may be possible for some systems to revert to a previous stable state, this option isn’t widely supported across all systems. Applying the fix to thousands of servers and workstations is a daunting task and could significantly disrupt daily operations in many offices.
Who Isn't Affected By The CrowdStrike Outage
Amidst the widespread disruptions caused by the CrowdStrike update, some digital platforms have remained notably resilient. Services such as Google and Google Workspace have not been affected by this particular lapse. This stability can be attributed to Google’s robust infrastructure and its independent security protocols, which do not rely on CrowdStrike’s systems.
Google’s ability to maintain uninterrupted service during such a significant global incident highlights the strength of its cybersecurity measures and system architecture. Google Workspace, known for its suite of productivity tools including Gmail, Docs, and Drive, has continued to function seamlessly, providing a reliable anchor for businesses that might be experiencing disruptions in other areas of their IT infrastructure.
Mac and Linux operating systems also remain untouched by the incident, and CrowdStrike CEO George Kurtz was quick to point out the issue wasn't caused by a security incident or cyber attack, and that the issue has been isolated and fixed on their end.
CrowdStrike is actively working with customers impacted by a defect found in a single content update for Windows hosts. Mac and Linux hosts are not impacted. This is not a security incident or cyberattack. The issue has been identified, isolated and a fix has been deployed. We…
— George Kurtz (@George_Kurtz) July 19, 2024
What Next for CrowdStrike and the IT Community?
Today's events have already sparked a broader discussion in the tech community about the responsibilities of cybersecurity providers and the potential vulnerabilities introduced by the very tools that are supposed to protect digital assets. Experts like Toby Murray and Ian Thornton-Trump have emphasised the severity of the issue and the urgent need for comprehensive solutions that can be applied swiftly and at scale to prevent future incidents of this magnitude.
Cybersecurity firms, including CrowdStrike, are now under scrutiny to enhance their update protocols and ensure that any changes to their systems undergo rigorous testing before deployment. This incident serves as a crucial reminder of the delicate balance between maintaining robust security measures and ensuring the stability of global IT systems.
As the situation continues to develop, the tech community remains vigilant, and companies affected by the outage are working tirelessly to restore full functionality to their systems. The lessons learned from this incident will likely influence future cybersecurity strategies and emergency response protocols, ensuring better preparedness for similar challenges ahead.