Automation and Orchestration No Longer a ‘Nice to Have’ for Internet2’s Next Generation Infrastructure

Subscribe for more like this



By Amber Rasche - Senior Communications Specialist, Internet2

Estimated reading time: 11 minutes

Chris Wilkinson, Internet2 Network Services Director, and Karl Newell, Internet2 Network Software Architect
Chris Wilkinson and Karl Newell

Chris Wilkinson is the network services director of planning and architecture at Internet2, and Karl Newell is the network software architect at Internet2. In this Q&A, Chris and Karl discuss how software-driven automation and orchestration sets Internet2’s Next Generation Infrastructure apart from any previous version of the national research and education network.

Internet2 established a new software team within Network Services about 18 months ago to support one of the key drivers of the Next Generation Infrastructure (NGI) – software-enabled automation, orchestration, and telemetry. Karl, can you tell us more about the team and its priorities?

Next Generation Infrastructure (NGI) logo

Karl Newell: Initially, the new software team’s primary focus was on how to automate and orchestrate the network and migrate all services onto the new infrastructure. Internet2 began looking at Cisco’s Network Services Orchestrator (NSO) about two years ago. Since the software team was assembled, we’ve been focused on getting NSO up and running, modeling all of our services in NSO, and preparing for the transition to NGI.

And we’re managing everything on the new network with NSO. That includes the network infrastructure itself and all of the customer-facing services on it, such as BGP peering and Layer 2 and Layer 3 circuits.

With NSO in place as our underlying automation workhorse, we’ve now started developing the UI components, APIs, and tools that will go on top of it. That includes building a new Network Services Console, which the community will soon be able to use to interact with the network and manage their services.

The team leading these software efforts includes Mike Simpson, director of network systems and software, Mark Feit, a principal software engineer who is primarily dedicated to perfSONAR development, James Harr, a DevOps engineer who is on loan from the Internet2 security team, Jonathan Stout, a systems and software engineer with the GlobalNOC who is one of our primary OESS developers, Christopher Green, a contractor who is helping to develop some of the UI components for NGI, and myself. I’ve been serving in various software-related roles at Internet2 for about two and a half years now.

At the outset of the NGI initiative, what were you aiming to achieve with the new software layer?

Chris Wilkinson: So for context, prior to NGI, the majority of our network configurations and changes were managed by engineers… by hand. An engineer would log into a router and manually generate a config element, often by referencing the existing configurations of another device already in operation. And because those elements are fairly standardized, that typically would involve minimal testing. 

Over time, that manual process creates configuration drift and inconsistencies because an engineer might, for example, choose to alter a config element or implement only part of it. And the best engineers in the world can make mistakes when manually completing this process, so there’s that element of risk as well. 

One of the goals for NGI’s software effort was to standardize all of those config elements and provide an automation workflow so that, with some basic information, an engineer can quickly deploy a full service (not a config element) that’s based on a template. The results are consistency and the ability to rapidly evolve our services as new features are added or new offerings are developed, in which cases we simply update an existing template or generate a new one.

Karl Newell: I agree. Two key benefits here are consistency and efficiency.

NSO offers improved capabilities to audit at scale what’s on the network. We can also use it to detect any deviations by comparing a config that we actually deployed with what we intended to deploy. And when a pushed config fails, it rolls back automatically. Generally, the consistency gained can help streamline our troubleshooting efforts and save time.

Another goal we have is to make network service provisioning as seamless as possible. Many of our community members have experienced how easy it is to self-manage services in the cloud. For example, if they need a new compute instance, they just log in and spin one up. We’ve been facilitating self-service for Cloud Connect through OESS for years, but now we’re taking a much broader approach to facilitate that across all of our network services. 

We want to give direct access for researchers and network administrators to view and manage their services, and we want to make it as easy – and automated – as possible. That’s where our new Network Services Console will deliver incredible value to the community.

Listen to Karl Newell share more about efforts to streamline the network service provisioning process for researchers and network administrators.

Chris Wilkinson: As Karl was talking, it struck me that what’s unique about our approach with NGI is that we want the totality of services and the platform itself to be under the control of the software layer. Internet2 has long been deeply involved with software-defined networking, and we’ve had a suite of various tools that maintain control over parts of the network. But there’s never been this single-layer, ubiquitous control over the entire platform until now.

What does this shift to a holistically automated ecosystem mean for engineers who no longer need to do these low-level manual configurations with the new software layer in place? 

Chris Wilkinson: Automation, in some ways, pulls our staff up a layer. Their cycles can shift from rudimentary device configurations to really meaningful work with the community to improve our services at a higher level.

Karl Newell: And those benefits extend to network engineers and operators across the broader community, too. We’re leveraging general APIs on top of NSO, so in addition to using the Network Services Console to view and manage services, they can build their own suite of automation tools around ours for this edge-to-edge ecosystem. You can envision this future where a community member spins up resources in the cloud, provisions a connection on our network, monitors that connection, and then tears it all down when it’s no longer needed – all in a completely automated fashion.

Can you talk more about the actual process of developing the software layer and making sure it meets the team’s needs?

Karl Newell: It’s been a long road that we started down years ago as we began looking at the ecosystem of available tools. 

In the summer of 2019, we put together a “multi-domain service orchestration” proof of concept. We asked our partners in the industry: If we wanted to orchestrate services – not just within Internet2 but across our regionals and member institutions, as well –  what would you propose as the state-of-the-art for achieving that? 

Cisco responded with NSO, which we found to be the most viable option. 

We wanted to demo NSO for the community at Internet2’s TechEX event that year, so a Cisco engineer and I sat down and, basically within a day, put together a proof of concept to show we could achieve that edge-to-edge orchestration with NSO. For example, if a user needed a connection from a campus to a resource in the cloud, we demonstrated that NSO could provision across all three service providers: the campus, the regional network, and Internet2.

Ultimately, I think what’s also made this effort successful is that our software team has been working hand-in-hand with our network team – sharing what we’re working on, demonstrating what NSO can do, collaborating to identify needs, discovering new uses, and constantly getting feedback. We’re continuously refining what we do with NSO and paying close attention to what the engineers want us to do next. Our goal is to make their lives easier and, as Chris mentioned, free up their time to focus on higher-level services.

How has the software layer changed the way you approached migrations compared to the 2012 or 2016 network upgrades?

Chris Wilkinson: We’ve been able to successfully move hundreds of services in a single maintenance window using automation, compared to pre-NGI maintenance windows in the past where only a dozen services might be moved using manual provisioning. Automation also reduces engineer fatigue, as a change requires less overhead during the maintenance window itself. 

For NGI, this is achieved through templating under NSO, which encourages the majority of the effort at the front end of the process. This allows for a high rate of deployment as well as the ability to push out bug fixes and enhancements so that we can fine-tune and accelerate the transition process over time. As a result, we achieved very high-volume migrations as the teams became progressively more comfortable with NSO and templating. We also pre-staged all of our physical changes on the platform because we physically had to move connections from one place to another – from one platform to the next – but through automation, we were able to conduct the final step of the service migrations in a way that was transparent to our community. 

And the rapid acceleration through the process wouldn’t have been possible had it been done manually or even using other existing tools, which weren’t set up with that particular migration technique in mind.

As the transition to NGI is nearly complete, looking back, what lessons did you learn? What, if anything, would you do differently?

Chris Wilkinson: One of the big lessons I’ve taken away is that you can’t automate something until it’s fully realized. Automation requires consistent and high-quality input information. For example, there’s an incremental aspect to manually editing a config that you just can’t get away with when it comes to automation. It has to be completely fleshed out before it’s ready for automation.

While working on NGI, we frequently found ourselves waiting for configuration input material to be ready for the automation team to then carry it forward. So that’s a lesson learned: allow more time to ensure that the input material is at 100% and is well understood up front before you get to that automation stage.

Karl Newell: Similar to what Chris is saying – like with any project, start earlier. And that’s coming from a team that really did start early! Or so we thought.

We learned a lot along the way, and we modified NSO for the better based on those lessons learned. The software team worked well with the network team on the original designs, but there’s always room for improvement with more time to iterate, test, gather feedback, and even conduct more training for the full engineering team.

Chris Wilkinson: To Karl’s points about training and timing, the NGI team had to learn fresh technologies on all layers of the network – and they had to do it in 12 months. While some of us had prior Cisco experience with our Layer 2 and Layer 3 services, this particular platform is brand new and NSO was new to most on our team.

It’s incredible that we’ve gone from beta and prototyping pre-RFP to a full production network in effectively 12 months. And the team had to learn all of the software tools, learn Cisco’s RPL programming language, translate that into NSO, and then debug it. And bugs are always expected, especially on a beta platform. Just recently, the team found a configuration issue that was mated with a Cisco bug, and we were able to deploy the fix for that very quickly using NSO instead of logging into each of the 90 routers on our network.

Internet2 effectively doubled the number of routers on the network with NGI, and those devices are also configured with much more complexity and resiliency. Now every city has multiple connections, and within every city there are multiple connections to multiple adjacent routers. 

As the infrastructure has expanded in scale and capabilities, automation and orchestration are no longer just nice to have – they really are a must-have. 

Listen to Chris Wilkinson share more about how Internet2’s network expansion makes software-driven automation and orchestration a critical component of the new infrastructure.

Now that the transition to NGI is mostly complete, where will you focus your efforts next? How can others in the community get involved to support those next steps?

Karl Newell: As I mentioned, the software team is starting to shift our effort toward developing the new Network Services Console, which is the UI that community members can use to manage their network services and gain visibility into monitoring and telemetry. Christopher Green has been hosting interactive design sessions with the community, asking them to test UI mock-ups and functional systems to solve realistic problems and then gathering their feedback to make improvements. In the coming months and into next year, we’ll also work to expose more of our services through the console to provide members with more direct access and control.

Chris Wilkinson: There are always ways to improve your operations and consistency, so in the coming months we’ll work to uncover parts of the config that we can further optimize and improve – from the high-level service layer down into the RPL. We may find that some of the ways we built services aren’t optimal or the community’s requirements for a service may change. And as Cloud Connect evolves and new science projects emerge, we’ll deploy new services that we haven’t even imagined yet. Automation and orchestration will be key for those activities on the horizon, as well.

Join Us to Learn More

Want to learn more about Internet2’s plans for software-driven automation and orchestration? Join us for the virtual TechEXtra 2021: Infrastructure & Advanced Networking event, December 1-3, 2021. View the schedule and register now to get the latest on our Next Generation Infrastructure, community project lessons learned, networking for cloud access, and research and education infrastructure updates.