Monday morning this week, Internet access in many areas across the U.S. were severely compromised and routes through several carriers reported problems. Although remediation was fast (about 90 minutes), a simple router misconfiguration in the Border Gateway Protocol (BGP) set-up through one of the carriers appears to be the culprit. And this time it appears to be a simple human error.
Down Detector’s user-generated outage map shows the scope of the problem:
The particular error, as noted by several sources, appears to be a route leak between ISPs. ISP’s use Autonomous Systems (AS) to discover the best paths through other service provider networks using BGP. In this case misconfigured router caused one ISP to send inefficient or bad routing path information to others, causing packets to be delayed or dropped completely.
As Wired further reported:
Route leaks can be malicious, sometimes called "route hijacks" or "BGP hijacks," but Monday's incident seems to have been caused by a simple mistake that ballooned to have national impact. Large outages caused by accidental route leaks have cropped up before...
Internet outages of all sizes caused by route leaks have occurred occasionally, but consistently, for decades. ISPs attempt to minimize them using "route filters" that check the IP routes their peers and customers intend to use to send and receive packets, and attempt to catch any problematic plans. But these filters are difficult to maintain on the scale of the modern internet, and can have their own mistakes.
Monday's outages reinforce how precarious connectivity really is, and how certain aspects of the internet's architecture—offering flexibility and ease-of-use—can introduce instability into what has become a vital service.
The key takeaways from this event are:
- Basic network connectivity can hardly been taken for granted in a world of more complex global networks. The infrastructure seems more brittle than we often assume.
- Managing the infrastructure of any large network can be an overwhelming task when a tedious manual error on one device can bring down large sections of the system.
- Any small error resulting in large scale outage can bring business to a halt, impacting revenue and increasing exposure to business liability.
The final takeaway is that this is exactly the kind of misconfiguration error that Forward Enterprise is designed to detect and report, well before reaching this kind of performance degradation and business impact. Our intent-based verification system can find the needle in the haystack to focus on the potential errors and how they can be remediated, across very large telco-scale networks.
As Wired noted, state of the art in detecting routing leaks has traditionally relied on route filters that are difficult to maintain on the scale of the Internet. On the other hand, Forward Networks proactively identifies misconfiguration errors like this ahead of time because we compare the network intent (or policy requirements like network A sends to network C through B) to the de facto BGP router configurations and can quickly alert any discrepancy. Forward maintains a mathematical model of all device configurations and routing paths and would expose where route leaks exist and which devices need to be addressed.
It’s a revolutionary approach and platform to address a myriad of tedious manual network configuration errors and policy violations. Interested to learn more? Get a quick demo of how we can analyze your network device configurations, network states and head off any disruptions in your business.