5 Questions for Scott Richmond and Jon-Paul Herron on Embracing Network Automation at ESnet
By Amber Rasche - Senior Communications Specialist, Internet2
Estimated reading time: 10 minutes
Let’s make network automation a priority in 2023. We’ve heard the community loud and clear: It’s time to move the conversation from “if” to “when and how” the research and education (R&E) community can embrace automation for the betterment of all.
In this “Embracing Network Automation” blog series, we’re gathering insights from several R&E network organizations that are paving the way for progress in this space.
Scott Richmond is the group lead for the Orchestration and Core Data team and Jon-Paul Herron heads Network Services at the Energy Sciences Network (ESnet), which operates a high-performance network built to support and facilitate big-data scientific research across America and worldwide.
Funded by the U.S. Department of Energy’s (DOE) Office of Science and managed by Lawrence Berkeley National Laboratory, ESnet connects the DOE national laboratories, supercomputing facilities, and major scientific instruments, as well as additional research and commercial networks, to enable global collaboration on the world’s biggest scientific challenges.
In this Q&A, Scott discusses how ESnet got started on its network automation journey, while Jon-Paul shares what’s next for ESnet in the automation space. Together they unpack some of the biggest wins and hardest lessons learned along the way and offer advice on how others in the R&E community can embrace network automation to benefit their organizations and the people they serve.
When did ESnet first embark on its network automation journey?
Scott Richmond: ESnet began looking at service orchestration and automation at the start of the design for ESnet6 in 2017, with a vision for self-service for provisioning and management of network services to increase availability and reliability. (ESnet6 is the latest generation of ESnet’s high-performance network, unveiled in October 2022.)
Historically, managing the network meant individual network operators logging into routers and making changes by hand. Early ESnet was small enough such that many of our engineers could keep its design and configurations in their heads. Through many years of evolution, a large portion of our configurations remained more customized than standardized. With the increased scope and scale of ESnet6, such customization meant significant additional effort and an increased likelihood of misconfigurations. “Boutique networking” makes troubleshooting at scale very difficult. So, to enable self-service, we needed to be very clear about what kinds of services ESnet provided.
A common perception about automation is that it’s primarily about saving time – and yes, you will save some time. However, that’s largely a byproduct of the main benefit: consistency of process. The automation will always do the same thing every time. It doesn’t get tired and make typos or other mistakes. That consistency leads to improved network reliability, stemming from improvements in the meantime to repair, requirements gathering, and prioritization. When everything follows standard patterns and processes, it’s much easier to track down and fix network errors.
We realized immediately that we needed to standardize the service offerings that ESnet provided with an eye toward reducing the complexity of available options. This meant spending time analyzing our current offerings and design, talking with our existing customers, and ensuring that we came up with options that were both consistent and met existing needs. Since this was not a fully greenfield deployment, most of our existing site connections would have to be migrated to the new network. For many sites, this meant starting early to engage with them to help migrate legacy configurations to a model that could easily conform to the new service offerings. Before we could even begin automating, we had to standardize configurations and reduce complexity – a multi-year effort that involved everyone in our network operations center, Network Services Department, and Systems and Software Department.
“A common perception about automation is that it’s primarily about saving time – and yes, you will save some time. However, that’s largely a byproduct of the main benefit: consistency of process. The automation will always do the same thing every time. It doesn’t get tired and make typos or other mistakes. That consistency leads to improved network reliability, stemming from improvements in the meantime to repair, requirements gathering, and prioritization. When everything follows standard patterns and processes, it’s much easier to track down and fix network errors.“
— Scott Richmond
What are some of the biggest wins ESnet has had in the network automation space and how are those successes benefiting the community you serve?
Scott Richmond: Since April 2021, we’ve developed 11 network service offerings and 66 distinct automated workflows. We’ve created 3,144 instances of those services and executed the distinct workflows 9,704 times to create or modify the network configuration and update various business systems. Previously, to make a network change, an engineer would have had to log in to one or more devices or systems and spend several hours or even days. For those almost 10,000 workflows, all they had to do was click a button to launch an automated task that ran in less than five minutes. That’s huge.
The biggest benefit from this is consistency, not only in our network configuration, but also in our documentation, business processes, source-of-truth database, and the language we use across ESnet. A common perception about automation is that it’s primarily about saving time – and yes, you will save some time. However, that’s largely a byproduct of the main benefit: consistency of process. The automation will always do the same thing every time. It doesn’t get tired and make typos or other mistakes. That consistency leads to improved network reliability, stemming from improvements in the meantime to repair, requirements gathering, and prioritization. When everything follows standard patterns and processes, it’s much easier to track down and fix network errors.
And there are fewer errors in general when you eliminate hand configuration. Automation also improves efficiency in engineer workloads. A complex hand configuration that might have taken several days to ensure is correct can now be completed in minutes with a high degree of accuracy. This frees up our team to engineer new and interesting services.
To get here, we leaned on the expertise of our international partners who had already made advances in this space. SURF in particular was very generous in open sourcing their Workflow Orchestrator, and NORDUnet provided valuable insights into using Cisco Network Services Orchestrator. Internally, we needed to grow a software development team who would collaborate closely with the network engineering team, which really understood the operations of the network. Network engineering and software engineering are two different but complementary disciplines, and both were key to the success of our automation efforts. One of the other important steps was the adoption of Agile and Scrum for building tooling. Much of what we built was software focused, so using industry best practices to quickly iterate on design allowed us to rapidly prototype and deliver our first automation workflows in record time.
Jon-Paul Herron: For our scientific users, many of the benefits resulting from ESnet’s automation journey are subtle. Changes for users can now be made faster. Outages are fixed more easily because we can make good assumptions about the configuration. The network is more secure. Engineers can focus more of their attention on the capabilities of our services and innovation, rather than on router commands. Other benefits are indirect: Automation has forced us to become even more rigorous about service management. While this ensures we have the standardization that automation requires, it also means we’ve become much more intentional about aligning our services to the needs of our unique user community. In the future, our users will start to see much more direct benefits as we move toward our self-service vision, which will allow ESnet services to become much more integrated into the automated scientific workflows of our users.
What are two of the hardest lessons ESnet has learned about network automation along the way?
Scott Richmond: While there are many technological challenges for ESnet, the biggest challenge was largely making a cultural paradigm shift. As mentioned, the real gain from service orchestration is increased consistency and accuracy of business processes, documentation, and configuration. To achieve this, there has to be a large shift in the network engineering model. First, you move away from hand configuration and toward modeling standardized services. Functionally, this has meant our engineers needed to think less about specific CLI, or Command Line Interface, commands and focus more on the capabilities and technologies we intend to deliver with the network as a whole. Moving from boutique networking to automated service offerings can actually reduce speed and efficiency at the beginning while people adjust to new ways of thinking and working.
This lesson for us at ESnet was that all the work to standardize processes and define services takes considerable time and effort – much more than we had initially thought it would. Additionally, we also had to recognize the importance of the entire lifecycle of network configuration. We couldn’t just place initial configuration on routers without consideration for the longer-term operations and management of the network. Simpler automation templates and scripts may at first offer speed and efficiency, but they make accuracy and maintainability more difficult later on. Learning this took time, as the up-front cost of fully orchestrating was not something we had accounted for in our project timeline. We still managed to launch ESnet6 two years ahead of schedule and well under budget. The lesson we’d share is to prioritize modeling your service offerings, conduct proper requirements gathering, and carefully define your standardized business processes.
What’s next for ESnet in the network automation space?
Jon-Paul Herron: The work to standardize business processes is never finished, as we’re constantly creating new products and services to advance the network. As we continue to focus on creating new and innovative services, we’re letting automation handle the regular care and maintenance of daily network operations more and more.
The next major phase of our automation journey is focused on user self-service. Our goal is to make every ESnet service configurable directly by users throughout the service lifecycle. To that end, we are focusing on adding more operations and management functionality to our service offerings, to support automating the full lifecycle of the services. We’re also working on efforts that ensure we properly monitor and measure services as they’re implemented through automation.
What advice would you give to peers across the R&E community on how to embrace network automation at their organizations?
Jon-Paul Herron: It’s easy to see a richly featured, fully automated system and want to jump right to that. But I would advise anyone going down this path to recognize that it is not a one-time project. The network automation journey will take a long time. No matter how much planning and research is done, it won’t be right the first time.
First, don’t neglect the need to think about your business, services, and workflows. Automation is only worth the investment if it means better services for your users. Make sure these are well documented, with clear requirements.
” ... don’t neglect the need to think about your business, services, and workflows. Automation is only worth the investment if it means better services for your users. Make sure these are well documented, with clear requirements.”
Scott Richmond: Second, identify the part of your network that is the most well-defined and standardized today, with the fewest exceptions — and start orchestrating. It will be slow at first, so make sure to build in the time that it will take to do this. Additionally, if possible, automate your existing services first, then migrate to new services. If you have the option, don’t automate as part of a network upgrade. One of the decisions ESnet made was that our automation would be a go-forward approach, only automating new services. We left our legacy services non-automated. This unfortunately meant that we were trying to design and build the tooling needed to deploy our services at the same time as the new network was being deployed. The analogy everyone was using was that we were trying to deconstruct and rebuild a plane full of passengers while flying it.
Jon-Paul Herron: As Scott recommends, take a highly iterative approach. No matter how much planning you do, it still takes experimentation and learning by trying. Jump in and start automating in some useful part of your system, and start learning what works and what doesn’t for your specific network.
Scott Richmond: Finally, be prepared to put in the investment to build trust in the tooling. Moving from manual, custom configuration to orchestration and automation only works when people trust the system and process to work consistently every time, without error. Ensuring that you put in the time and effort to test your automation thoroughly, and to consider the interactions of the different services, will save you time and effort in the long run.