The clock strikes around 6 am GMT on Friday, and the world comes to a grinding halt due to an unprecedented IT outage that single-handedly disrupted airlines, stopped trains in their tracks, and delayed critical healthcare services.
So, the real question is: why did this happen?
The blame seems to be being placed at the door of CrowdStrike, a cybersecurity firm that offers cloud-based online security solutions to tech giants like Amazon Web Services (AWS), Microsoft, and some of the world’s leading banks and airlines.
The issue arose when a seemingly routine piece of antivirus software proved defective. As a result, it became the root cause of widespread chaos, as major companies and services worldwide were brought to a blue-screened halt.
CrowdStrike was quick to react to this news by posting a reassuring blog post hours later, in which George Kutz, CrowdStrike Founder and CEO, stated, "The outage was caused by a defect found in a Falcon content update for Windows... This was not a cyber attack" and goes on to offer an apology and that they understood the gravity and impact of the situation.
Microsoft Vice President David Weston estimates that “CrowdStrike’s update affected 8.5 million Windows devices or less than one per cent of all Windows machines.” However, he recognises that while the percentage of devices affected is small, the broad economic and societal impact is huge.
One sector that was arguably the most badly affected was the airline industry. CrowdStrike’s defective software update had significant repercussions for major aviation carriers such as American Airlines, British Airways, and Virgin Australia, as passengers arriving at airports across the globe could not board their flights.
The entire airport experience has become much more seamless in recent years as technological advancements have led to online check-ins, digital boarding passes, and even digital queuing processes in some cases. However, the global IT outage resulted in airline staff having to go back to basics as they resorted to manual check-ins, which are tried and tested.
While this may have provided a solution for some who could board their flights successfully, many travellers and flight catchers were laboured by long queues, flight delays, and cancellations. Sky News reported that just over 5,000 flights were grounded on July 18th, which equates to nearly 5% of all scheduled flights globally.
The mayhem that ensued from CrowdStrike’s ineffectual software update inconvenienced more than just the travel industry. Healthcare professionals also had a torrid time as they experienced difficulties with communication systems, patient records, and administrative processes. Similar to the airline workers, some hospitals had to resort to a back-to-basics mindset as they opted for manual record-keeping and communication methods in response to the crisis.
Patient care had to be put on hold as appointment scheduling and access to critical medical records were removed due to the outage, resulting in significant delays and operational challenges for hospitals and general practices (GPs). Emergency services were also impacted globally as 911 services were temporarily shut down in the USA.
Time will tell how significant the global IT outage was for healthcare systems worldwide, as surgeries, critical procedures, and routine appointments were delayed or cancelled.
The railways were also equally affected. Commuters and travellers had their plans turned upside down as they faced failures regarding ticket dispensing machines. UK railway operators such as South Western Railway and Gatwick Express were forced to warn customers that they were experiencing widespread IT issues. They advised them to purchase their tickets online instead of buying them on-site.
At this stage, a couple of weeks after the event, we cannot pinpoint the costs and repercussions incurred due to this intentionally felt IT failure, as the true fallout may take months to calculate. However, we can estimate that the following implications would have been felt by the companies involved:
Last week’s global IT outage triggered by a flawed software update from cybersecurity firm CrowdStrike sent shockwaves that reverberated across industries worldwide. Organisations grappled with disruptions that spanned from airline companies to healthcare providers. In many cases, they had to resort to manual, back-to-basics processes to solve the outage.
Here are the key takeaways and lessons we should take from the extraordinarily disruptive event:
Expert Opinions and Final Thoughts
Michael Hamer (Virtual IT Director): The Netitude Take
CrowdStrike’s update testing was flawed. Their preliminary write-up identifies the need for all updates to pass through internal testing, ‘canary’ testing, and then a wider rolling deployment. This is already standard software practice, but there’s pressure to get protections out as soon as possible in security software.
Lessons will be learned across the industry, and ultimately, many IT systems will be made more reliable. However, it’s not a solved problem. Software is complicated; mistakes can and will happen again.
Despite CrowdStrike cancelling many IT engineers’ weekends, people are coming forward to defend their track record. CrowdStrike has saved them from many potential ransomware attacks that would have been just as disruptive at a company level.
The key takeaway is that society and businesses receive huge benefits from using technology: saving time and money far outweighs even this ‘biggest ever’ outage.
For management teams, work out your key business functions, identify dependencies (technology and otherwise), plan for outages, work out alternatives that let you keep working in some capacity and train staff. You never know when you need to fall back on pen and paper for a day.
How Netitude’s Clients Were Affected
While our clients were not directly affected, key software providers or partners they work with may have been. We work with market leader SentinelOne (CrowdStrike is their closest competitor). They take a different approach to updates without compromising security, allowing for deeper testing before updates go out. “Bad updates” is a scenario we consider and prepare for in our internal Incident Response planning. Planning ahead is the best way of minimising the potential for disruption.
Final Thoughts
One lesson from this global affair can be learned: investing in your IT is critical. It’s a stark reminder of the severe repercussions and consequences that widespread technological failures can feel.
Partnering with experts – such as globally recognised Managed IT Service Providers like Netitude – can provide proactive risk management to ensure potential faults and failures are located before they cause outages and downtime.
If you’re worried about cybersecurity processes or have doubts about your general IT infrastructure, contact our friendly team of experts today. They’re happy to alleviate any IT-related concerns or queries you might have.