April 27, 2017

Seeking Truth in Networking: From Testing To Verification

by Brandon Heller

Sharp network admins already verify the network in a variety of ways, right? Pings, traceroutes, and custom scripts verify expected connectivity. Link and CPU utilization monitoring programs verify normal operation. Maybe pushed configs are read back in to verify that the device accepted them. And isn’t verification just another term for testing, anyway?

No. Just like data ain’t knowledge, testing ain’t verification!

In this blog post, you’ll see how stories from other technology domains (like semiconductors and software) point toward verification as a natural next step in any critical environment, especially one like networking.

Learning from history: verification in other domains

Computer Hardware

Manufacturing the next great computer chip begins with producing a mask set – a stack of extremely detailed circuit patterns – which is then used to etch circuits onto a silicon wafer. If a mistake slips into the circuit design and is discovered in the lab, a new mask set is required. This is a big deal! A single bad mask set can cost millions of dollars and delay a product’s introduction.

But even worse is when a bug escapes the lab and customers find it. Remember the Intel Pentium FDIV bug? A rare but easy-to-replicate math error would corrupt data. Recalling and replacing the affected processors cost Intel nearly $500M.

It’s not as if Intel didn’t test their chip in advance. Skilled engineers wrote unit tests to confirm the expected behavior of their own circuits, created integration tests to confirm the behavior of entire modules, and ran extensive test patterns in the lab. But it’s extremely hard to think of every test worth running.

What if – instead of having more people write more tests and yet still have to cross their fingers – we could write a program that would know whether the chip logic was correct, to avoid any errors like FDIV corruption?

Such programs exist under the category of verification. One particular verification technique, called model checking, could have helped avoid the FDIV error. It works by representing circuit states in a graph and checking all possible transitions; the trick is to do this fast enough and on large enough circuits. In fact, this development was so important that it led to a Turing award for its creators.

Verification techniques can’t prevent chip fabrication errors and won’t design the chip for you, but they provide extremely valuable assurance that you can’t get any other way.

Computer Software

In the software world, correctness is just as important, especially when a bug has real real-world effects. Think software that operates an airport baggage systemcontrols a radiation dosage, or guides a rocket into space, like on June 4, 1996, when the Ariane 5 rocket made its maiden voyage to space.

Thirty-seven seconds into the flight, the rocket exploded.  Why?  To determine the rocket’s position, an inertial measurement unit adds up sensor values over time.  When this code path experienced an integer overflow – but had no way to recover from it – the rocket lost control.

The ESA extensively tested their software in advance and even re-used known-good guidance code from the Ariane 4.  But this rocket followed a slightly different flight trajectory, and the engineers had missed testing with sensor inputs that would have matched the rocket’s actual flight profile.

“Testing shows the presence, not the absence of bugs”

The standard approach to testing software is familiar – test code by itself, test it with other pieces, and then test with realistic inputs to all the pieces bundled together.  Beyond this, many developers employ coverage tools to measure the completeness of their testing, to confirm that all lines of code are actually tested. But in all likelihood, even coverage testing would not have revealed this software fault; even systematic full-line-coverage testing can miss key input sequences.

What if a program could trace through every execution path in a program, for every possible input, to confirm the absence of this bug?

Software verification programs exist, and Ariane 5 became a poster child for them.  One example, static analyzers, verify that entire classes of problems are absent from a piece of software, including data leaks, race conditions, infinite loops, and – relevant to not just rocket builders – unhandled overflows.

Network Testing Today

A single network testing error can have real-world consequences, too.  Remember when a network configuration change brought down Amazon EBS and much of the Internet?  Just look through the news for examples of planes, trains, and stock-trading terminals brought to a standstill by network issues.  Why are we not doing enough to keep companies off the front page?

Limitations of today’s testing and how to move past them

Partial Coverage vs Complete Coverage

Network testing today is partial.  We run pings, traceroutes, and maybe custom monitoring tools, but do these tests accurately reflect all the real paths used by all the applications?  You don’t know and you can’t even measure how much you’re missing.

Indirect Evidence vs Direct Proof

Network testing today is indirect.  We check utilizations and counters against thresholds, but is a real problem present?  Over-threshold links, ping failures, and non-zero drop counters can trigger for normal transient behavior, while under-threshold links may actually be unavailable for application traffic.  It’s important to not just identify that the network may be broken, but point out where and why so you can actually fix it.

Single-element view vs Whole-Network View

Another common tactic is device-level conformance checking.  An auditing tool gives a big thumbs up to a configuration change to shift traffic, because it matches the master template. It’s better than nothing, but it doesn’t answer your question: will the network behave, end-to-end, as it should?

The Aha Moment: Verifying Networks

At this point, your spider-sense should be tingling. When other industries experienced critical, front-page errors, it spurred the development of new ways to address the limitations of human-driven testing. We’re now in a world where every business is becoming a software business and the network is a crucial enabler.

Other industries acted. We can act too. We’ll focus on one specific technology.

Network Data-plane Verification provides a way to test that the intent of the network matches its implementation, as gleaned from forwarding state. For example, you can verify that all hosts in a subnet can reach each other on any port, despite being spread across multiple data centers.  You can confirm general properties, like freedom from loops.  You can even use this to catch future-outage-causing issues, like both sides of a link having different VLANs defined. Unlike today’s testing, it is comprehensive, it is direct, and it analyzes end-to-end behavior.

Why did it take so long to get here, when other industries had verification tools 20 years ago?

  • One possible answer: Systems that aren’t clearly defined. RFCs and vendor manuals are not formal, unambiguous specifications; interactions between protocols are frequently undefined.
  • A second possible answer: Overwhelming diversity. Who has the time to model the multi-dimensional matrix of protocols, OSes, devices, and versions in every modern network?
  • A third possible answer: Modern networks are huge. Any analysis must scale to an enormous number of potential paths and packets.

We know the answer – all of the above, and more! – because we’ve lived these challenges for the last few years to bring the Forward Platform to market.  Ambiguity and diversity are addressed by careful testing, and now we support a majority of the world’s networking equipment. For scaling, like in other verification contexts, the key is to build a model that captures important behavior details, but abstracts away the rest; we represent packets with wildcards, using the Header Space Analysis framework.  A range of practical challenges pop up too; I highly recommend A Few Billion Lines of Code Later, which describes these for static program analysis.

At Forward Networks, verification is at the core of what we do and how we think about networking.

Traditional testing will always have its value, and verification is not a panacea for all network issues.  But it offers something uniquely valuable: confidence that your network actually works as intended.  

I hope this blog gave you a glimpse of that, and introduced some fascinating studies in engineering failures in the process.

Many thanks to those who contributed to this post, including Andi Voellmy, Nikhil Handigol, and Siva Radhakrishnan.

Learn the modern way that companies stay off the front page

Experience a demo of the Forward Platform

Subscribe to our blog!

September 13, 2023
A Financial Services Company Saved “7 Figures” By Improving Network Inventory Management

Everyone knows inventory management is important – but so are the 100+ other things we need to do, and let’s face it, the inventory is not on fire. Given the benefits one customer experienced, maybe it should be.   On September 14, at 2:00pm Eastern time, we’re hosting a webinar, featuring special guests, Michael Wynston, Director of Network Architecture and […]

Read More
September 7, 2023
What’s worse than a toothache?

For me, I’d have to say it’s sitting through a high-pressure demo with a sales guy who needs to close business. Given the choice, I’ll take the dentist office visit anytime, at least they give you meds! We realize that sales demos aren’t always pleasant. And while we strive to create a comfortable environment for […]

Read More
August 23, 2023
How do you Monitor and Manage a Network Without Borders?

There are only two options for managing a global multi-cloud network: either by using a combination of inference, hope, and intuition or with mathematical certainty. When conducting 5 million financial transactions daily, it’s essential to operate with certainty, regardless of your network’s size or geographical distribution. Auditors don’t accept inferences; they demand certainty when determining […]

Read More
crossmenu linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram