Contributed by Amir Kupervas, Managing Director, Anodot.
For CSPs employing extremely complex systems, fully autonomous monitoring technologies are the holy grail. As monitoring platforms mature, there is a growing expectation that they will go from anomaly detection to full remediation. This is not your run of the mill industry buzz. Over the last five years, monitoring has evolved to the extent that autonomous remediation (aka “the action phase”) is the logical next step, likely to become a dominant feature for leading CSPs. But to get there, robust machine learning capabilities are key.
Why siloed, threshold-based monitoring falls short
In order to ensure availability and reliability, CSPs need to stay on top of thousands of metrics. Today, a typical network is monitored in silos — every network layer is monitored separately and every network type differently — while utilizing rule-based or static thresholds. The siloed approach prevents effective correlation between related issues.
Threshold-based monitoring gives rise to billions of alarms with a very high rate of false positives, since it’s based on manual thresholding for a system that is too complex and volatile to adhere to predetermined states. What is worse – static monitoring leads to late detection of service degraragation and incidents. Even after detection, there is no context to go on for expedited resolution.
The three pillars of zero touch
AI enables the transformation towards automation and intelligent operations through three crucial steps that can only be achieved by applying cutting edge machine learning: anomaly detection, anomaly correlations and root cause analysis, and, finally – remediation.
Anomaly detection. In the first stage, ML enables real time monitoring of 100% of the network data from devices, radio networks, current and legacy core networks, services, transport, IT operations and any other source and layer. Leading monitoring platforms feature fully autonomous baselining that also accounts for different seasonalities and constantly and optimally adapts to change. By monitoring the full scope of data using adaptable algorithms, anomalies are detected faster and false alarms are reduced to a minimum.
Cross-silo correlations for quick root cause analysis. One of ML’s superpowers is its ability to correlate across billions of metrics. When such a technology is leashed, it autonomously prioritizes and correlates between different related events and glitches across multi-layered and multi-vendor networks. These correlations provide the full context of what is happening, enabling teams to swiftly get to the root cause of every issue for the fastest possible remediation.
Remediation. By autonomously pinpointing network anomalies and correlating them, ML-based monitoring is paving the way for autonomous remediation. The technological roadmap is leading towards automation rule mapping and a fully automated ML remediation engine. In this scenario, the ML-based system will go through phases 1 and 2 – anomaly detection and root cause analysis – recommend an action based on previous actions, execute the action through the remediation engine, and fine tune its operations through a closed feedback loop, increasingly improving its reactions.
If autonomous remediation is the holy grail, then machine learning capabilities must be taken into consideration when evaluating autonomous network monitoring solutions.
Anodot is realizing the autonomous network vision by providing CSPs with the ability to monitor service experience. We collect all data types, at any scale, and correlate anomalies across the entire telco stack.
Our end-to-end Service Experience Monitoring Platform detects service-impacting incidents in real time, helping customers like T-Mobile and Megafon reduce the number of alerts by 90% and shorten their Time to Resolve by 30%.