BLOG | Nov 13, 2024

Scaling: How We Process 10^30 Network Traffic Flows

Yasser Ganjisaffar
Yasser Ganjisaffar is the SVP of Engineering at Forward Networks. Yasser holds a Ph.D in Computer Science from UC Irvine.

Who should read this post?

Network engineers and IT professionals managing large-scale, complex networks who seek scalable solutions.
Decision-makers at enterprises utilizing hybrid, multi-cloud networks who want to improve security, agility, and reliability.
Potential employees interested in working with a company at the forefront of network modeling and scaling technology.

What is covered in this content?

How Forward Networks’ digital twin is uniquely able to accurately model complex network environments, supporting devices across multiple vendors and cloud platforms.Network engineers and IT professionals managing large-scale, complex networks who seek scalable solutions.
How Forward Networks alone can meet the technical challenges of processing vast numbers of network traffic flows and the advanced engineering efforts that allow these flows to be computed quickly and efficiently with low resource requirements.

What does Forward Networks do?

Forward Networks ensures that the world's most complex and mission-critical networks are secure, agile, and reliable. A mathematical model of the network, including computations of all possible traffic paths, is built by collecting configuration data and L2-L7 states from networking devices and public cloud platforms. With support for major cloud providers, including AWS, Microsoft Azure, and Google Cloud Platform, Forward Enterprise stands out as the go-to solution for large enterprises managing hybrid cloud networks with multiple vendors.

What makes computing network traffic flows so difficult?

Large enterprise networks contain thousands of devices (switches, routers, firewalls, load balancers, etc). Each of these devices has complex behaviors. Consider a large graph with thousands of nodes, each representing one of these devices, and the links between nodes show how they are connected. Network traffic originating from edge devices needs to be precisely modeled.

To do so, you need to understand the exact behavior of each device in handling different packets. A typical enterprise network includes several different types of devices (routers, firewalls, etc.) and many firmware versions for every kind of device (Cisco, Arista, Juniper, etc.). To build a mathematically accurate model, you need to model every corner case, and a lot of these are not even documented by vendors.

We have developed an automated testing infrastructure based on a mathematical model to predict forwarding behavior. We purchase or lease these devices, put them in our lab, inject various types of traffic to them, and observe how these devices behave.

How do we process networks with over 50,000 devices?

Let me explain how we can process networks with over 50,000 devices on a single box or cloud instance. Here is a screenshot of an example network with about 50k devices:

Our customers send us obfuscated data that helps us identify and resolve performance bottlenecks. To obfuscate the data, every IP and MAC address is randomly changed to a different address, every name is also converted to a random name, and these mappings are irreversible. Obfuscating data does not materially change the model's behavior because obfuscated data is still representative of the original network’s complexity and diverse network behaviors. Sharing this data is a win-win scenario. Our digital twin gets better over time, and customers get even faster processing time. The network in the above example is built from such data.

This network includes more than 10³⁰ flows. Each flow shows how a group of similar packets traverses the network. For example, one flow might show how email traffic originating from a specific host and destined to another host starts from a datacenter and then goes through several backbone devices before arriving at the destination data center.

Each of these flows can be complex. If we were to spend 1 microsecond to compute each of these flows, it would take us more than 10¹⁷ years to compute this. After years of advanced engineering work, algorithmic optimizations, and performance optimizations, we are able to process this network in under an hour on a single box. In the majority of cases, the computation scales linearly. For customers who need faster processing speed or higher verification throughput, we offer a cluster version, which can be scaled up or down as needed.

What lessons have we learned?

Our first data source was a very small network. We became better, faster, and more scalable as we optimized our software, allowing us to reach out to customers with larger networks and find the next bottlenecks. As we gained access to larger data sets, we saw new patterns we hadn't anticipated, which helped us improve the computation core of our software multiple times. Using obfuscated data enables us to model some of the most complex and regulated networks in the world without compromising customer security, and it proves that we can support any private or government network.

What minimal hardware requirements does Forward Networks’ on-prem software work with?

Today, provisioning an instance in AWS, Azure, or other cloud providers with 1TB or more RAM is easy. Yet you might be surprised to know how long it takes some customers to provision a single on-prem instance with a modest amount of memory. To ensure customers quickly experience the value of our software, we streamline the initial setup process, so you can start using it right away during the proof-of-concept period. This means you won't have to wait for lengthy provisioning times or deal with low-priority ticket delays, allowing you to evaluate the software's benefits in your environment efficiently and without hassle.

Forward Networks has learned to be very careful when adding new tools, frameworks, or dependencies. Because our resource requirements are so low, our developers are able to run the entire stack on their laptops, which is essential for fast debugging and rapid iteration.

We have spent a lot of engineering time and effort on making this possible. Here are some of our high level approaches:

Avoiding repeated computation.
Deduping data structures in memory and on disk.
Lazy computation: delaying processing to when it is actually needed.
Making core data structures as compact as possible and with very low serialization and access overheads (our in-house serialization implementation has ideas similar to Google’s FlatBuffers).
Using fastutil for fast and memory efficient collections in Java. Our fork even improved its performance and added support for immutable structures.
Detecting and optimizing actual bottlenecks through performance profiling.

You can't simply use a cluster with 1000 nodes when you need to scale to 1000x or 10000x. It is not economically justified even if it is possible. To get the same result with minimal resources, you need to do the hard engineering work. Most of our customers run our network digital twin on a single computer. But we also offer the cluster version for those customers that want to ensure high availability or have more concurrent users and want to have higher search or compute throughput. We support deployments in customer-owned AWS or Azure cloud environments for those who wish to use their own cloud.

Our customer told us they were amazed by what our software delivers, given such low requirements after having to provision and maintain a few racks of servers for a comparable software solution (in the same space as us but not exactly our competitor). This validated our efforts. By focusing on low-compute mechanisms to solve the problem, we’ve enabled our customers to accelerate deployment while saving money.

Are open source tools always the answer?

In the early years of our startup, we relied on off-the-shelf platforms and tools. Over time, it became clear that while these platforms are generic enough to be applicable to a wide range of applications, they were not appropriate for our platform because they didn’t support the level of customization we required.

For example, initially we were computing all end to end network behaviors, indexing, and storing them in a generic platform. Eventually, it became evident that precompiling all such behaviors was not feasible. Even if it were possible, such an index would be enormous in size. We switched to a lazy computation approach where we would pre-compute just enough data to perform quick searches, and at search time we would do the rest of the computation that was specific to user query.

Because of the limitations of these generic platforms, we developed our own distributed compute and search platform. This in-house development is the foundation of our ability to scale.

What’s next?

While we believe we have already built a product that is a significant step forward on how networks are managed and operated, our journey is 1% complete. Our vision is to become the essential platform for the whole network experience, and we have just started in that direction. If this is something that interests you, please join us. We are hiring for key positions across several departments. Note that having prior networking experience is not a requirement for most of our software engineering positions.

If you operate a large-scale complex network, please request a demo to see how our software can de-risk your network operations and return massive business value.

Winner of over 20 industry awards, Forward Enterprise is the best-in-class network modeling software that customers trust

Visit our press room Hear from Gartner

From Fortune 50 institutions to top level federal agencies, users agree that Forward Enterprise is unlike any other network modeling software

Evaluate the reviews Chat with an expert