IT failures are more common than we think. Admitting that you have a problem means there is a risk in doing business with you, and that leads to customer churn. Most companies try to conceal their failures and pretend there is nothing wrong. Ultimately, the clients suffer.

For banks and financial institutions, such problems are particularly unsettling. IT disruptions lock costumers out of their accounts, disabling them from paying for food, rent or petrol. Not only is there a financial loss because of customer churn and damage control, but the institutions are sometimes fined for their IT shortcomings.

Still, the number of IT failures is on the rise. This is a known and old problem with no easy fix: banks are running legacy systems that are 30-40 years old. Not only were they not built for today’s network challenges, but several new responsibilities have been added on top of the stack, such as ATMs, online banking and mobile banking. The new functions are written by different teams in different programming languages, which add to the complexity. As a result, few people fully understand the entire system.

As the British Treasury committee had requested, some banks started to publish their IT failures, which showed the institutions are suffering from well over one outage per month. Barclays, which had the highest number of incidents, reported 41 cases in 9 months. 

But financial institutions are reluctant to upgrading their systems. Not only would this be a costly option, but it also carries a great risk. After all, the old system has already worked for several decades, while new systems are not as battle tested. Furthermore, the upgrades and migrations can become a source of problems themselves, as we witnessed in the case of TSB.

For that reason, we’re witnessing a deadlock. There is a chance that an upgrade could make the system fail now, while not upgrading would make it definitely fail later. Is there a safe way to prevent current IT failures without choosing between bad or worse?

A touch of AI

“Problems” are tricky: they never tell you in advance when they are coming. No one can anticipate them, so no one knows how many resources should be allocated to prevent them.

Banks used to run in batch mode in the middle of the night. As such, any problems were cleared before the working day would start. But the current load causes systems to go wrong during the day, and the media quickly spreads the word

Having versatile IT teams that can quickly address and resolve problems have long been the standard method. But naturally, there are no dedicated “problem teams”, which means the IT team must halt working on a feature or issue and concentrate on solving the problem.

IT operations are empowered with dashboards and analytical tools that report a system’s health and alert the user when a certain threshold is passed. Such systems suffer from two problems: they cannot issue accurate warnings since an alert is not necessarily a problem, and they are built on finding and resolving problems as they occur. Artificial Intelligence, however, offers a different path.

Artificial Intelligence programs are software that are self-programmed. Unlike traditional software, the developers do not code them – instead, they “train” them by giving the program huge amounts of data and running algorithms on that data to detect subtle patterns, invisible to the human eye. This is what makes AI seemingly intelligent and has enabled it to outperform humans on several occasions.

This pattern-recognition capability makes AI the ideal solution for fighting IT failures, as it can detect the patterns that would lead to failure.

In simple terms, since AI relies on big data, it can form a dynamic baseline. In other words, instead of manually picking a threshold, it can choose the threshold and constantly modify it while considering several other parameters. A high CPU use, for instance, might be normal behavior when there are many online users. Simultaneously, a much smaller load can prove to be unusual if there are no users on the system.

The dynamic baseline model has several advantages. On the one hand, it can pinpoint the exact failing system and component, and tell us where the root cause of the issue is. On the other hand, it can warn about and prevent issues before they turn into problems.

With the diagnostic tools, there is no way to tell if an alert is actually indicating a problem or not, as it only points out the current state. But AI can take the entire historical data of a case, build the trajectory and let us know well in advance if a problem is pending.

It is much easier to prevent problems before they happen, as firefighting means the damage has already been done, and a considerable amount of resources must be spent on remedying those damages as well as rebuild customer trust. And since we are addressing the issue at a stage where damage is not done yet, we can even equip the system with automated scripts to resolve the problem well in advance. The system can “self-heal.”

Using AI in this context offers several advantages, especially on complex networks. Many times, IT failures are blamed on insufficient hardware resources, but the problem keeps popping up no matter how much RAM or CPU cores we add. Until the actual problem is detected, much unnecessary cost is imposed on the business.

Since AI tracks the history of all components, it can find pattern deviations much more efficiently than a human operator and point out the exact failure. In one case, for instance, we detected that adding extra CPU cores and scaling a system vertically was not as efficient as scaling it horizontally, since the system also had to deal with many TCP/IP requests. Without detecting this root cause, an IT failure would be imminent, while the management would live under the false impression that the extra CPU cores have solved the problem.

We do have old problems in the financial industry. But perhaps new solutions can solve them.

Published June 26, 2020 — 19:30 UTC