The digitization of services and commerce means that companies of all types must now provide services 24 hours a day, 7 days a week. It is therefore necessary to identify and deal with the nerve centers of the information system, as well as the potential Achilles heels of IT infrastructure. The costs of malfunctioning or exploiting a loophole regularly make the headlines – for example when banks, retail brands, public institutions, service providers and other large organizations attract the wrath of their customers when their services become unavailable following an incident – such that managers are perfectly aware that the success of a company also lies in the resilience of its systems.
Today, key indicators are defined to ensure that data is always available, reliable, secure and actionable. Thus, from databases to customer service operations, all stages of the value chain require the implementation of contingency plans. Security and resilience are becoming inherent criteria in business processes. In short, to best respond to the risk of disruption or interruption of services, it must first be envisaged, made concrete, then give the teams the means to deal with it more quickly and efficiently.
However, while the decision to build resilience in this way is legitimate and laudable, it sometimes misses an important element: the intricacies of how IT systems and infrastructure actually work. It can then turn out to be counter-productive. To understand why, we have to go back for a moment to the history of computing as a discipline.
The rise of the specialty in the computer field
Looking back, it is impressive to see how quickly and deeply specialized IT has become as a professional field. Only two generations ago, information technology offered almost no career opportunities. Today, it is not only one of the main areas of employment in the world, but workers in this sector are called upon to assume roles and develop key skills, sometimes very specific, in order to respond to the request.
As minor and integrated as a technology may seem, there are entire teams that focus on it on a day-to-day basis. The notion of quality of service does not have the same meaning for a help desk employee or an IT engineer. The notion of speed does not have the same meaning in a building made up solely of offices as it does in a high-performance computing centre. And security doesn’t have the same implications for a network specialist as it does for a cryptography expert. And yet, these are all requirements that, from a business strategy perspective, can be considered the same and interchangeable. This confusion between general and particular also exists in the vision of a company’s resilience.
When a disruptive event occurs, an IT professional’s first reaction is not how it impacts the business, but how it relates to their own definition of resilience, which they have defined as a measure of quality performance. Rather than identifying the root of the problem, the teams take care of their respective areas of responsibility, whether it’s restoring database availability, maintaining order fulfillment, protecting channels internal communications, etc.
While this method of managing responsibilities has some advantages in emergency situations, it also requires more time for different IT departments to resolve an incident. And taking longer means, of course, suffering greater financial and reputational damage.
A new approach to operational resilience
This should not be seen as calling into question the motivation or rigor of computer production professionals, or a call to go back on the specialization of computer science: the problem is broader and more structural.
IT professionals have specialized over decades, and so has the way we sell, buy and connect IT infrastructure. In order to optimize management, budgets are allocated in a segmented fashion, and suppliers sell their offers in the same way. This is done at the expense of a global and coherent vision of the impact of the exploitation of vulnerabilities and flaws on the company. Indeed, it encourages focusing on specific metrics like SLAs or critical goals and outcomes, rather than why the technology is needed in the first place. To give a simple example, if an incident occurs, such as a break in the link with a data center, the inaccessibility of information remains anecdotal compared to the impossibility for customers to place an order.
We therefore need to rethink the management of operational resilience and its measurement in these disciplines. The application of service level agreements (SLAs) is reassuring on paper, because these SLAs are specific to the technology they measure, but they do not necessarily reflect the consequences in real situations. Instead, we need to start from the “Minimum Viable Company” (MVC) – that is, the definition of the essential, even vital, processes, functions and services of a company – and ask ourselves how the whole system of information can support them.
This change of approach induces a certain complexity, because it is first necessary to clearly understand what constitutes this MVC for an entity. Then, understand what systems and infrastructures support these activities (which many companies omit), and finally test what the impact of each vulnerability or flaw would be on the entire organization (which many companies do). companies do not yet). At the highest level, a holistic view of resilience becomes paramount. Only this overview will make it possible to identify what is really essential, even vital for the company, and to implement its resilience in an effective and operational way.