Most of us woke up to news of a global IT outage on systems running Microsoft windows platform. Unfortunately, a lot of consumer facing businesses were using this platform, causing widespread global disruption to businesses and operations.
Crisis is normal in any endeavour. Those of us who grew up within the IT industry are not strangers to outages. We increasingly understood how to deal with them. This is why we design for continuity and recovery. We have layers of identification, authentication, and authorisations. We design Business Continuity and Disaster Recovery systems. We agree recovery point objectives and set recovery time objectives according to business requirements. I am sure many Operations Centres globally triggered these systems and procedures, relying on them to restore Operations. Above all, Change Management allowed us to plan, test and manage rollout of software in stages, At present, it is not clear what went wrong with this process.
Crowdstrike, a managed security service provider pushed out an update for Microsoft Windows that had a bug. It only affected companies using Crowdstrike software to protect their systems. To fix it, computers had to be booted up in safe mode and the offending file removed. The disruption is regrettable, but we will learn from it.
Crowdstrike belongs to those new generation of tools that provide an extra layer of security on your operating system. They have evolved from our traditional anti-virus (I wonder how many folks remember Dr Solomon !!) software to sophisticated tools that protect you against all kinds of threat vectors. Today, they wear all sorts of tonga – antivirus, antimalware, antispyware, endpoint integrity, patch management, etc. In this case, there was a bug in the patch distributed by Crowdstrike.
Once in a while, the horse bolts, and folks have to bring it back home. Times like this allow us to detect the flaws in our design or weakness in our processes. The bigger picture is how we are so reliant on technology, and any outage presents us with unacceptable impact and potential fatalities.
An erstwhile accepted risk in automated systems management has rudely scaled up our risk matrix. I am in no doubt that many firms would review their systems management strategy to determine whether another layer of assurance is needed.
Our confidence in automated systems cannot falter as they have helped us manage the incredible complexity that advanced technology has introduced into our daily lives, both work and personal. It took us a long time to agree that complex IT projects would fail. However, complex IT systems built up incrementally do work.
Some writers have highlights what they term as the risk of too many corporates running Microsoft platform. We have been grappling with this fear since 1998. Many dinosaurs have emerged, and many have disappeared. It’s a moving goalpost….. The problem did not come from Microsoft. It came from a third-party software used to manage Microsoft systems, amongst others. The key issue here is whether the strategy of transferring a risk to a third party is enough.
There is too much reliance on standards today. Are you ISO xxxx compliant, tick, tick…. Corporates need to strengthen their third-party risk assurance process on a continuous basis. It must move from a compliance-based approach to a risk led approach.
No company is an island, so we would continue to have interdependent systems. We are learning the hard way that it is not enough to transfer a risk to a third party, your assurance processes must dynamically assess your evolving system configuration and deal with it appropriately.
It appears that systems recovery has been completed, but business recovery continues…
Toibudeen Oduniyi
@tobydeen
https://www.linkedin.com/in/deeno/