Sharp network admins already verify the network in a variety of ways, right? Pings, traceroutes, and custom scripts verify expected connectivity. Link and CPU utilization monitoring programs verify normal operation. Maybe pushed configs are read back in to verify that the device accepted them. And isn’t verification just another term for testing, anyway?
No. Just like data ain’t knowledge, testing ain’t verification!
In this blog post, you’ll see how stories from other technology domains (like semiconductors and software) point toward verification as a natural next step in any critical environment, especially one like networking.
Manufacturing the next great computer chip begins with producing a mask set – a stack of extremely detailed circuit patterns – which is then used to etch circuits onto a silicon wafer. If a mistake slips into the circuit design and is discovered in the lab, a new mask set is required. This is a big deal! A single bad mask set can cost millions of dollars and delay a product’s introduction.
But even worse is when a bug escapes the lab and customers find it. Remember the Intel Pentium FDIV bug? A rare but easy-to-replicate math error would corrupt data. Recalling and replacing the affected processors cost Intel nearly $500M.
It’s not as if Intel didn’t test their chip in advance. Skilled engineers wrote unit tests to confirm the expected behavior of their own circuits, created integration tests to confirm the behavior of entire modules, and ran extensive test patterns in the lab. But it’s extremely hard to think of every test worth running.
What if – instead of having more people write more tests and yet still have to cross their fingers – we could write a program that would know whether the chip logic was correct, to avoid any errors like FDIV corruption?
Such programs exist under the category of verification. One particular verification technique, called model checking, could have helped avoid the FDIV error. It works by representing circuit states in a graph and checking all possible transitions; the trick is to do this fast enough and on large enough circuits. In fact, this development was so important that it led to a Turing award for its creators.
Verification techniques can’t prevent chip fabrication errors and won’t design the chip for you, but they provide extremely valuable assurance that you can’t get any other way.
In the software world, correctness is just as important, especially when a bug has real real-world effects. Think software that operates an airport baggage system, controls a radiation dosage, or guides a rocket into space, like on June 4, 1996, when the Ariane 5 rocket made its maiden voyage to space.
Thirty-seven seconds into the flight, the rocket exploded. Why? To determine the rocket’s position, an inertial measurement unit adds up sensor values over time. When this code path experienced an integer overflow – but had no way to recover from it – the rocket lost control.
The ESA extensively tested their software in advance and even re-used known-good guidance code from the Ariane 4. But this rocket followed a slightly different flight trajectory, and the engineers had missed testing with sensor inputs that would have matched the rocket’s actual flight profile.
“Testing shows the presence, not the absence of bugs”
The standard approach to testing software is familiar – test code by itself, test it with other pieces, and then test with realistic inputs to all the pieces bundled together. Beyond this, many developers employ coverage tools to measure the completeness of their testing, to confirm that all lines of code are actually tested. But in all likelihood, even coverage testing would not have revealed this software fault; even systematic full-line-coverage testing can miss key input sequences.
What if a program could trace through every execution path in a program, for every possible input, to confirm the absence of this bug?
Software verification programs exist, and Ariane 5 became a poster child for them. One example, static analyzers, verify that entire classes of problems are absent from a piece of software, including data leaks, race conditions, infinite loops, and – relevant to not just rocket builders – unhandled overflows.
A single network testing error can have real-world consequences, too. Remember when a network configuration change brought down Amazon EBS and much of the Internet? Just look through the news for examples of planes, trains, and stock-trading terminals brought to a standstill by network issues. Why are we not doing enough to keep companies off the front page?
Network testing today is partial. We run pings, traceroutes, and maybe custom monitoring tools, but do these tests accurately reflect all the real paths used by all the applications? You don’t know and you can’t even measure how much you’re missing.
Indirect Evidence vs Direct Proof
Network testing today is indirect. We check utilizations and counters against thresholds, but is a real problem present? Over-threshold links, ping failures, and non-zero drop counters can trigger for normal transient behavior, while under-threshold links may actually be unavailable for application traffic. It’s important to not just identify that the network may be broken, but point out where and why so you can actually fix it.
Single-element view vs Whole-Network View
Another common tactic is device-level conformance checking. An auditing tool gives a big thumbs up to a configuration change to shift traffic, because it matches the master template. It’s better than nothing, but it doesn’t answer your question: will the network behave, end-to-end, as it should?
At this point, your spider-sense should be tingling. When other industries experienced critical, front-page errors, it spurred the development of new ways to address the limitations of human-driven testing. We’re now in a world where every business is becoming a software business and the network is a crucial enabler.
Other industries acted. We can act too. We’ll focus on one specific technology.
Network Data-plane Verification provides a way to test that the intent of the network matches its implementation, as gleaned from forwarding state. For example, you can verify that all hosts in a subnet can reach each other on any port, despite being spread across multiple data centers. You can confirm general properties, like freedom from loops. You can even use this to catch future-outage-causing issues, like both sides of a link having different VLANs defined. Unlike today’s testing, it is comprehensive, it is direct, and it analyzes end-to-end behavior.
Why did it take so long to get here, when other industries had verification tools 20 years ago?
We know the answer – all of the above, and more! – because we’ve lived these challenges for the last few years to bring the Forward Platform to market. Ambiguity and diversity are addressed by careful testing, and now we support a majority of the world’s networking equipment. For scaling, like in other verification contexts, the key is to build a model that captures important behavior details, but abstracts away the rest; we represent packets with wildcards, using the Header Space Analysis framework. A range of practical challenges pop up too; I highly recommend A Few Billion Lines of Code Later, which describes these for static program analysis.
At Forward Networks, verification is at the core of what we do and how we think about networking.
Traditional testing will always have its value, and verification is not a panacea for all network issues. But it offers something uniquely valuable: confidence that your network actually works as intended.
I hope this blog gave you a glimpse of that, and introduced some fascinating studies in engineering failures in the process.
Many thanks to those who contributed to this post, including Andi Voellmy, Nikhil Handigol, and Siva Radhakrishnan.