When automation met rocket science at BT.
Neil J. McRae describes himself as a “space and technology nut”. As Annie Turner discovers, this drives both his personal and professional life.
McRae, MD Architecture & Strategy and Group Chief Architect at BT will have been with the operator for ten years this month. He was named after astronaut Neil Armstrong and is a hugely enthusiastic gamer, online and off – he owns 20 pinball machines. Before the pandemic, he hosted tournaments in his gaming room, from where he did this interview. He believes profoundly that technology can transform lives.
McRae says BT has been working on network automation “forever”. He explained, the firm “invested a lot in what we historically called OSS, with the goal of reducing operating costs. I would argue it wasn’t very successful and we needed to change our approach entirely”.
Let the Games begin
That change began in 2011 as London was to host the Summer Olympic Games the following year and BT was responsible for all the networking. McRae says, “Our starting point for the strategy was that the 100 meters sprint lasts 10 seconds. The historical way of running network automation or network management meant…you poll the network and do it again 15 minutes later” – which clearly would not be much help if anything went wrong.
McRae explains, “We were trying to figure out how change the network management paradigm…then if we had a problem on our core network, it could take 30 or 40 minutes to figure out where it was or what was causing it.” He adds, “We needed automation because humans can’t react quickly enough”.
The Olympics aside, the issue of automation was increasingly acute as telecoms were becoming integral to many functions where failure could endanger life. For example, running signalling for trains, and systems for hospitals and the emergency services. McRae notes, “It turned out some other oragnisations were looking at this in the same way, and we stumbled on each other in this process”.
Houston, we have a problem
The first “accident” was that McRae happened to be at NASA in Houston – BT runs networks for the US government and public sector. He was allowed into one of the historic Apollo control centres which happened to be simulating what had happened to Apollo 13.
McRae comments, “I have seen that film [Apollo 13 with Tom Hanks] hundreds of times and met two of the astronauts, but I was thinking that they were able to get those people back – because of all those systems that fed a continual stream of data about what was happening into a mainframe.
“I thought, imagine if we did that with a network: if every device sent us a real-time stream about its status. We could take those feeds, put them into a data lake and analyse the data, in real time, or historically. The devices tell us they’re OK until something changes – that’s streaming telemetry.”
He continues, “The next stop was to automate that input with some big data technology, and now, increasingly, with AI, [by which McRae means machine learning].We’ve been working on [applying AI] since 2014.”
What would Google do?
That work received a boost when in 2015, when BT found out that Google was looking at almost identical AI technology, so the two worked together to standardise it for use across telco devices, applications and servers.
Although AI is still in its infancy in ops, BT uses it to help plan and reconfigure the network. Hence rather than configuring the network to send traffic via alternative routes to avoid congestion, BT builds a model of the network and the characteristics it wants the network to have. The automation generates the appropriate configurations then the network makes the changes.
Simulating scenarios, avoiding surprises
The COVID-19 outbreak was an unexpected and massive stress test: overnight millions moved to working from home (WFH) and BT itself had thousands of people who needed VPNs to WFH securely.
However, the company had already simulated such scenarios and was confident it had the right capacity in the right places because using both real time and historical data, it had been able to build accurate pictures of what would happen. He said BT had been “very confident” that the network would perform, which it did. “Automation helped us with many of the activities we had to do as a telco to cope with what was a massive change in the profile of networks – probably the biggest I’ve ever seen,” McRae added.
He says BT is ahead of the game in some areas and has shared expertise with the industry, “because the more vendors and people that support it, the more we can get out of it”. For instance, BT developed some streaming capabilities itself, but also used off-the-shelf software to help with some aspects of the analytics.
Open source rises
Open source has become important too with BT using some Kafka-based capabilities it developed. Kafka is an open-source stream-processing software platform developed by the Apache Software Foundation, written in Scala and Java. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.
BT’s engineers are moving from a world that was predominantly about configuration to one that is increasingly about writing code. Although many could write code to start with, McRae has sponsored some to develop and hone their skills – he is a programmer by trade – as BT moves towards that goal of having entirely programmable networks.
He stresses that another key driver towards automation at the start, ten years ago, was that, “Then we had maybe three or four network protocols for different types of networking solutions. Today, it’s more like 40, or 50. Humans can’t cope with that level of complexity.”
Ultimately, AI and automation will make the network centres redundant, and the network experts will be training the AI rather than configuring the infrastructure. McRae says that although some people are scared AI will replace their jobs, “[Our engineers] realised that the status quo couldn’t work…they themselves started asking for the tools to help them run it all better.”
Part of that is transitioning services to cloud native. The first service to become cloud native was BT’s TV platform, which is “all programmable”. As a greenfield, rather than legacy, application, it was relatively straightforward for a first attempt.
McRae says, “We’ve APIs to add channels, move channels, add content, remove content, and over time, more and more telco capabilities will be delivered in that way. We’re really pushing cloud native as the right technology for the telco the future.”
He adds, “With our wonderful colleagues at Ericsson, we are building a 5G cloud-native core network that we will turn on probably later this year. It’s running in trials and working great, but you make sure it’s 120% right before putting customers on it”.
McRae states, “The reason we want cloud native is we see requirements for 5G in our core data centres, but also at the edge as we roll out new services, like virtual reality.” BT is also looking “to place one of our edge cloud-native nodes inside a business customer to help them automate their businesses or designs or whatever activity they do where the complexity requires it to be automated”, he says.
The journey towards native cloud is an uneven one, depending on the starting point for any given service and desired outcomes. McRae says, “A big question is our legacy telco network – the PSTN. Our new IP voice platform is virtualised – cloud enabled, but eventually will be cloud native.” He reckons BT is probably about a quarter of the way through the IP voice project, but on schedule. It will deliver better quality voice and enable BT to scale and make changes more easily.
Orchestration and application-awareness
This brings McRae says another key aspect regarding cloud native. He says, “We partner a lot of organisations and being able to bring those organisations into our delivery is an automation in itself: orchestration.”
For instance, customers can include Zoom in their BT communications package. He explains that while a blip during a Zoom call in many instances is not an issue, in some cases it would be, such as with 200 analysts on a banking call. McRae says, “It has to work and clearly to ensure people have understood correctly. We bring our high-quality voice platform to Zoom and offer it to customers in a way that’s easy for them to buy and use in their business.”
McRae says, “Our goal is to use automation to become much more user- and application-centric. So the network realises it’s a Zoom call and optimises the network in one way, and in another when the kids in the house fire up the Xbox at the same time. So the Zoom call and Fortnite are both as good as they can be.
“Over time, that will become essential. Possibly there’ll be some discussion about net neutrality, but it won’t affect us. Net neutrality is about throttling one thing to provide better service for another and that’s not what we’re doing here. We are looking to give everything that it needs to perform”.