arrow down
Arrow down
Arrow down
Arrow down

In the world of networking, misconfigurations and inconsistencies can lead to significant issues for businesses, especially those in highly regulated industries such as financial services. One Fortune 500 financial services company experienced a nightmare scenario with their MTU (Maximum Transmission Unit) settings, resulting in application and performance problems. Fortunately, the company found a solution in Forward Networks' digital twin. This blog post will delve into the MTU issues faced by the company and how Forward Networks helped them overcome these challenges.

The MTU Nightmare:

The financial industry is known for its tight control over deployments and configurations. However, this also means that any misconfiguration or inconsistency can have severe consequences. The financial services company had been facing application issues and performance problems that had been occurring randomly, making it difficult to identify the root cause. Eventually, it was discovered that there was a misconfiguration in the MTU settings of a cross link between core devices. While jumbo frames were enabled north and south, this cross link was set to a lower MTU size of 1500. The issues became more apparent when the primary path failed, and traffic started traversing this misconfigured link. The network devices had to fragment the traffic, resulting in processing delays.

Before incorporating Forward Networks' digital twin technology, the company had programmers writing custom scripts to identify misconfigurations. These scripts were scattered across personal drives, making it challenging to consolidate and analyze the vast amounts of data they generated. The team became overwhelmed with Excel files containing close to a million lines, making the investigation process nearly impossible.

Recognizing the need for a more efficient and centralized approach to network analysis, the company turned to Forward Networks. The company saw promising results with Forward Networks' digital twin solution, which offered out-of-the-box capabilities to address their MTU issues. Forward Networks provided a pre-built script specifically designed to identify MTU misconfigurations, convincing the company of the platform's suitability for their needs.

Although the pre-written script provided by Forward Networks yielded results, the company still needed to narrow down the information it wanted to see. Despite not being a programmer, one of the company's network engineers was able to customize the script using educational resources provided by Forward Networks and the company’s user community. The engineer successfully created a tailored solution that only displayed infrastructure MTU information, filtering out unnecessary data.

By leveraging the capabilities of Forward Networks' network assurance and intent-based networking platform, this company saved a significant amount of time by automating the identification of MTU issues. Forward Networks enabled them to replace manual, fragmented scripts with a centralized solution, empowering their engineers to analyze network data more efficiently. In turn, they have minimized downtime and ensured that their network infrastructure is robust and reliable.

By George Lawton, VentureBeat

The network, once seen as little more than plumbing in the datacenter, is at the center of distributed IT operations. Ensuring network operations and protecting them from cyberattacks has become paramount to modern enterprises.

“You can almost imagine that networks would be on par with power and water and electricity and that kind of stuff,” Nikhil Handigol, co-founder of Forward Networks, tells The Next Platform. “You cannot imagine a modern business functioning without its network functioning. On one hand, networks were super critical for big businesses and are becoming increasingly critical. On the other hand, they’re becoming more and more complex and more and more fragile, both from a connectivity perspective and from a security perspective. From a connectivity perspective, they were so fragile that one misconfiguration could take the entire network down. It’s still the case.”... [READ MORE ON THE NEXT PLATFORM]

By George Lawton, VentureBeat

This is the second of a two-part series. Read part 1 about the current state of networking and how digital twins are being used to help automate the process, and the shortcomings involved.

As noted in part 1, digital twins are starting to play a crucial role in automating the process of bringing digital transformation to networking infrastructure. Today, we explore the future state of digital twins – comparing how they’re being used now with how they can be used once the technology matures.... [READ MORE on VentureBeat]

By George Lawton, VentureBeat

Designing, testing, and provisioning updates to data digital networks depends on numerous manual and error-prone processes. Digital twins are starting to play a crucial role in automating more of this process to help bring digital transformation to network infrastructure. These efforts are already driving automation for campus networks, wide area networks (WANs), and commercial wireless networks... [READ MORE on VentureBeat]

I recently published a piece in Dark Reading covering the network security challenges of M&A activity.  As we ease the restrictions put in place to combat COVID-19, we’re expecting to see business activity including M&A pick up speed, it’s important that the implications of integrating networks are fully understood to ensure that the expected business benefits are achieved as soon as possible. 

Economists from JPMorgan Chase, Goldman Sachs, Morgan Stanley, and more are predicting that the U.S. is about to enter an economic boom, with estimates ranging from 4.5% to 8% expected economic growth. With the economy recovering, Deloitte found that many companies and enterprises expect their M&A activity to return to pre–COVID-19 levels within the next 12 months – and are starting to eagerly eye the market. But today’s M&A’s are more complicated than ever, with the involved organizations needing to account for vital cybersecurity, privacy, and data management practices during this process. 

In fact, recent analyst research uncovered that the biggest hurdle to effectively managing the integration phase of a deal in today’s environment is technology integration. 20% of businesses noted effective integration was the most important factor in achieving a successful M&A – and 28% identified execution/integration gaps as the primary reason their M&A transactions didn’t generate expected value. As I mentioned in the Dark Reading article, a company being acquired is also a target for bad actors, as they look for openings and vulnerabilities in smaller companies that can later give them access to the larger enterprise’s network – Deloitte found that the top concern in executing M&A deals for U.S. executives and private-equity investor firms is cybersecurity (51%).

With technology integration being one of the most important and most difficult factors for a successful M&A – how can companies set themselves up for success?  

The secret is to have a full understanding of the IT infrastructure. Unless you know how everything is connected to everything else, you really can’t make any good architecture decisions to change things. And the starting point is always the network. But this is a herculean challenge in and of itself. Every network is uniquely crafted by the company’s distinct needs and the personal approaches of the network engineers involved. Each network with its specific devices, firewalls, and configurations is going to operate and function differently – nothing can be assumed.

To drastically accelerate and de-risk M&A integration, IT needs to have a detailed understanding of all of the network topology and behavior. But it’s very hard to discover, most network maps and inventories are incomplete or very out of date, as manual processes for these issues are near impossible. Trying to write down a device list, map out the data paths, note all the configurations, figure out the operational processes, and enforce the network-wide security postures would take a full network team months or years depending on the complexity. For businesses that find themselves in this predicament, it is vital that they invest in solutions that can analyze their digital infrastructure to discover existing assets and to map the network.

Depending on the particular pain points, network analysis solutions range from network monitoring and visualization, to intent-based capabilities like network verification and prediction. Network and application dependency mapping tools can inform teams how the various applications and devices act with and rely on one another. Even something as simple as a help desk ticketing system can provide useful data for these ends. 

With a live network map, the companies can then evaluate the infrastructure for cybersecurity compliance and for future integration. Tools like port scanning, network configuration checks, and path verification allow IT to see if the network is operating consistently and is compliant with company policies. IT will especially want to focus on solutions that root out existing liabilities, such as vulnerability assessments, penetration testing, and compliance assessments. For instance, a network digital twin allows enterprises to overlay security policies on other networks – allowing for identification of network compliance issues, flagging outdated configurations, locating forgotten equipment, proactively unveiling security violations, and alerting operators of unpatched vulnerabilities. 

It’s ideal if the chosen solutions can also normalize the network data (present the data in a vendor-agnostic manner), making it much easier for IT to quickly read and understand the various infrastructure devices and configurations. This is particularly helpful for network operations staff addressing help desk tickets – who are dealing with tickets and issues across both networks at the same time after having merged. With a normalized dataset, IT can then efficiently merge both companies’ data together to jointly analyze the infrastructure – allowing for a much faster, more simple and comprehensive examination of the networks. This is impossible to do without a comprehensive view of existing data, so many enterprises look to data management tools and platforms to help locate and consolidate their critical data. 

Connecting and integrating the network infrastructure is the moment of truth for the M&A – businesses need to ensure that everything will continue to operate properly before internal operations can actually be merged. Having a normalized and accurate network map gives the IT team a scope of the two company’s networks – allowing for the identification of conflict areas that need to be worked out before merging networks together to ensure that there is no risk to the production and client services. With the right software, the process can be automated, so it’s faster and more accurate, and intent checks can also make sure that traffic is doing what it should or pinpoint the problem for immediate resolution.

Using this information, IT can identify critical network and application paths that need to be preserved in isolation and potential points where the two companies’ infrastructures can be connected. This has several key security and financial purposes. It allows for a check of whether the network architectures are compliant with one another, and it also lets the companies see where there is excess infrastructure that can be removed. Network path verification tools can also allow IT to preemptively see any potential integration holes by visualizing what the new data paths will be, so the team can address any lack thereof ahead of time with stop-gap solutions. 

When encountering different regulatory hurdles, it’s usually best to make the higher bar the standard across both organizations – simplifying the security and compliance policies. Services like next-generation endpoint protection, next-generation firewalls, and other solutions that protect data and applications from attack — are important for securing the IT environment after a merger. 

The risk involved in merging the digital infrastructures of major enterprises is simple to summarize: if you don’t know what it is and how it works, you can’t ensure it will continue to work if changes are made – like integrating it with another network. Even worse, it can aggravate the already existing security flaws or holes that are wrapped into your security paradigm. By integrating new devices and data paths to parts already able to be compromised, IT is increasing vulnerability and risk.

In today’s world of digital transformation, it’s more important than ever that enterprises engaging in M&As both empower and protect themselves by properly approaching network integration and adopting services where needed to support network analysis.

Today’s networks are too complex for manual network management and updates.  With most enterprises composed of tens of thousands of devices spanning multiple geographical locations, on-premises hardware, Virtual environment, and multiple clouds – it’s virtually impossible to push updates manually.   Also – the sheer volume of vendors and coding languages can be overwhelming for a network operations engineer.  In most cases learning a new language or new platform takes eight weeks to achieve basic proficiency; its not realistic to expect human skills to scale at the pace of network innovation (aka network complexity)

Fig 1. Itential Automation Example

Which is why we decided to integrate the Forward Networks platform with the industry-leading network assurance platform, Itential.  Their low-code automation platform makes it easy for network operations teams to deploy and manage multi-domain infrastructures.  Itential’s cloud-native software as a service offering provides a low-code interface that seamlessly connects to any IT system, cloud, or network technology for end-to-end closed-loop network automation and orchestration. Forward Enterprise enables network operators to deploy automated changes with the assurance that they are in compliance with network policies and won’t have any unintended side effects.

Fig 2. Closing the Loop: Automation + Verification

Forward Enterprise helps network operations engineers avoid outages through its unique mathematical model. The platform creates a digital twin of the network (across on-premises devices, private and public cloud) enabling network operators to map all possible traffic flows, instantly troubleshoot, verify intent, predict network behavior, and reduce MTTR (mean time to resolution). Itential simplifies and accelerates the deployment and management of multi-domain network infrastructure. Both platforms support major network equipment vendors and AWS, Azure, and Google Cloud platforms.

Fig 3. Automate Service Provisioning with Forward Networks and Itential

The Closed Loop Automation process enabled by the integration of Forward Networks Platform and Itential Automation Platform (IAP) acts as a safeguard to prevent any issues from becoming pervasive following a change window.  Using the pre-built automations, templates, form builder, automation builder within Automation Studio makes it easy for network operations engineers to build an automation catalog that enables changes at scale.  By using the API integration with Forward Networks, they can verify routing, add intent checks, verify new service connectivity, check for side effects and send notifications and verifications via Slack, Microsoft Teams, Cisco WebEX, and email.  Integration with change management systems including ServiceNow and Jira ensure everyone is working from a single source of truth and expedites collaboration. In the event of an issue, the diff check functionality within the Forward Networks platform makes it easy to pinpoint which changes are causing any unplanned behavior.

For more detail on how the integration works, please view our ONUG Spring 2021 session.

Network operations teams rely on highly specialized tools developed by individual vendors designed to address particular problems. The result? Most enterprises have 10+ Network Operations applications in place and they don’t talk to each other—which means that network operations engineers spend an exhaustive and unnecessary amount of time toggling between applications and sifting through information as they work to resolve tickets. Multiple tools providing state information introduces inconsistencies in the data accuracy and level of detail.

Because information is not portable between applications or is vendor-specific, inaccessible because it’s siloed due to security boundaries across the network, or current, the teams charged with network and security operations are at a disadvantage. When people working to solve a problem have incorrect, incomplete, or out-of-date information they cannot efficiently solve problems.

We don’t think it should be that hard

Forward Networks was created to make the hard parts of network operations easier.  For us, that means giving instant access to the information you need to troubleshoot and resolve network issues. 

The Forward Networks platform is based on a mathematical model that creates a digital twin of the network.  This software-based twin provides a comprehensive visualization of all possible network paths, a searchable index of configurations presented in a vendor-neutral manner easily understandable for even tier-one support specialists, the ability to verify network behavior, and predict how NAT or ACL changes will impact the network.  Network state information is updated at regular intervals determined by the operations team.

To ease the burden on network operations teams, we’ve developed an integration between Forward Networks and ServiceNow that provides a single source of truth for the network and enables more efficient use of both platforms. The integration between the applications allows engineers to automatically share relevant details about network state, configuration, and behavior with everyone working on resolving this issue. This information automatically updates within both platforms creating a detailed and current single source of truth.  The integration between the two applications takes only seconds to enable and configure. 

Reduce Mean Time to Resolution (MTTR)

A typical incident response involves several teams, the network operations engineer who got the call, maybe the apps team or security team, more senior engineers if the case needs to be escalated. The difficulty of resolving issues is compounded when everyone is working from their own assumptions and data. One of the most effective ways to reduce mean time to resolution is by creating an accurate single source of truth and ensuring everyone involved has access to it.  

Because Forward Networks regularly verifies that the network is behaving as intended, it can (at the discretion of the network operation team) proactively open, update ServiceNow incidents based on these verification checks. Whether incidents are created automatically or manually, a link to the relevant data becomes part of the incident and is updated as the system collects network state information, this ensures everyone is working from the same information.  For existing ServiceNow incidents, the Forward Networks integration allows network engineers to capture relevant information and add it to the incident, again saving the resolution team time they would have spent researching the issue.

This integration also allows networks operations to verify that the changes they’ve made have resolved the issue by running a query.  The platform will show if the issue is resolved or allow the engineer solving the issue to see how their change impacted the network and what else may be causing the issue, this way tickets can be followed through to resolution.  Incident history can be viewed from within Forward Networks or ServiceNow allowing the engineering team to see all actions and status from their platform of choice. 

The real benefit of this integration is immediate access to information that reduces the mean time to resolution from hours to minutes for most problems. 

See the Forward Networks ServiceNow integration in action

Have 5 minutes? Watch the Forward Networks and ServiceNow integration in action on our Forward Fix – engineering content by engineers, for engineers. 

Multicast IP Routing protocols have been increasingly popular over the last several years. 

They are used to efficiently distribute data to a group of destination hosts simultaneously in a one-to-many or many-to-many model

Typical examples of applications where the use of multicast is very common are audio/video streaming services (e.g content delivery networks, IPTV) and financial trading platforms (e.g Stock Exchanges). 

This blog is not intended to be a multicast tutorial but, but it aims to showcase how Forward can map and search multicast-enabled networks. Before going there, just a quick recap, at a very high level, of the multicast components: 

Putting all together, a Multicast Client signals that it’s interested in receiving a multicast stream identified by a Multicast IP address using a Group Management Protocol (e.g. IGMPv3). 

Multicast Source sends a multicast stream of data to a Multicast Group and the Multicast Routers forward the multicast streams to the Multicast Clients using a Multicast Routing Protocol (e.g. PIM-SM). 

If your background is more on applications, the concept is similar to a Pub/Sub (Publisher/Subscriber) messaging service where the Publishers in multicast are the Multicast Sources and the Subscribers are the Multicast Clients. 

Easy right? 

Well, the concept might be easy, and the basic configuration can be fairly straightforward but when something doesn’t work it’s a nightmare to troubleshoot a multicast network. 

This is where Forward Network can help you make, one more time, your life way easier! 

Forward Enterprise allows you to get full visibility and analysis of your multicast network by searching for multicast insights and performing multicast path analysis in the Forward Search application. 

Search for multicast group 

To search for multicast groups, all you have to do is navigate to the Search application and type the multicast group IP in the search bar. 

Forward Search automatically recognizes the IP address as a Multicast group. 

Let’s dive into the info provided by the Forward platform for the given multicast group: 

The Search Results pane provides a Multicast group category where you can find information about the Rendezvous Point name and IP address, the VRF where the group is configured in and direct links to the Rendezvous Point configuration and state for the given multicast group. 

The Multicast Group Card in the middle pane provides an at-a-glance view of all the group members as well as direct links to the configuration and state for the given multicast group for each member. 

The Topology map on the right is context-aware and highlights the devices relevant to the content you are navigating, like the Rendezvous Point, the devices in the VRF where the RP is configured, and specific members. 

You can also search multicast groups by subnet. 

In this case, all the matching multicast groups are shown in the Search Results along with any RP information. At this point you can select one multicast group to get more information as in the previous example. 


Multicast Path Analysis 

There are several IP Multicast protocols but the most commonly used are Protocol-Independent Multicast – Sparse Mode (or PIM-SM) at L3 and IGMPv3 at L2. 

With PIM-SM, multicast traffic flows on primarily two types of routes: (S,G) or s-coma-g and (*,G) or star-coma-g.
The (S,G) routes flow from a specific Multicast Source to a Multicast Group while traffic on a (*,G) route flows from any Multicast Source to the RP and from the RP to all the Multicast Clients of group G. 

Forward Enterprise provides PIM-SM Control Plane Modeling to compute complete (*, G) and (S, G) trees and show them in the Forward Enterprise topology along with the tree’s details. 

For instance, you can do a (*, G) path analysis to find out all the paths between the RP and the potential Receivers by issuing a query from the multicast group IP with Destination IP the multicast group IP: 

You can see all the different paths from the RP of the multicast group destined to the group members using the Paths selector and see hop by hop summary information as well as specific device state at Layer 2 and Layer 3. 

To perform a (S,G) path analysis, simply provide the multicast sender IP address and the multicast group IP address. 

In this case, the path analysis provides not only all the paths between the Source and the Receivers at a data plane level as in a common Path Analysis for unicast traffic, but it provides a Control Plane validation for each path as well. 

If both indicators in the Control plane validation section are green, it means that Forward Enterprise didn’t find any issue in the Control Plane, neither from the Sender to the RP nor from the Receivers to the RP, as you would expect. If either of the Control Plane validations are shown in red, it means Forward Enterprise has found some issues and it provides some insight that can help you find  whether connectivity exists from the sender to the RP and from the RP to the receiver. 

It also shows you what interface would be selected in case a receiver will show up based on the Reverse-Path Forwarding (RPF) for the given Sender and a brief explanation: 

As usual, more capabilities will be added in the future to make it even easier to work with your Multicast network, so stay tuned and, in the meantime, watch the demo video above and happy multicasting with Forward Networks!!! 

Here we are with yet another blog on the Forward Network Query Engine (or NQE for short).

If you have been reading our previous blogs on this topic, you already know how passionate I am about NQE.

In my first blog Query Your Network like a Database, I talked about how NQE helps to solve common challenges in network automation when it comes to retrieving network device configuration and state to verify the network posture, especially in large networks comprised of many different vendors, technologies, spread across on-prem and cloud deployments.

In a subsequent blog, In-App NQE Checks Examples Now Available on GitHub I described how In-App NQE Checks helps build network verification checks based on the NQE data entirely in the Forward Enterprise user interface, and I’ve introduced a GitHub repository with some examples.

If you haven’t read those blogs and you are not familiar with NQE, well, you might want to do some reading on the topic before coming back here 🙂 

Still (or back) here? Great!

In this blog, I’m going to talk about a big improvement we have made in our latest release, the NQE Library.

Many of our customers are enthusiastic about In-App NQE Checks. They say it’s very easy to find violations to their network intent using the intuitive language, the built-in documentation, data model schema based on OpenConfig, the provided examples, and so on.

As it frequently happens, the more customers use a product extensively, the more use cases come up.

One of the use cases that came up from several NQE users has been:

“In some scenarios, we are not looking for violations [yet] but network insights instead.
Can we do that with NQE?”

Now you can with the NQE Library!

At a very high level, we have decoupled the NQE queries from the NQE Checks to enable the new use case (find network insights) while preserving the original use case (find network violations).

In a nutshell, the NQE Library allows you to easily create and organize collections of NQE queries.

The NQE Library workflow consists of the following steps:

  1. Create a query
  2. Commit it
  3. Use it in Verify optional

As shown in the NQE Library page below:

Fig 1. NQE Library Workflow

Create a query

To simplify the creation of NQE queries we have built the NQE Integrated Development Environment (IDE).

Fig 2. NQE Integrated Development Environment (IDE)

It consists of 4 different panes:

All the panes are collapsable and resizable to allow you to manage the screen space more efficiently.

If you are familiar with the IDE we built for the In-App NQE checks you will notice that the biggest difference is the introduction of the Files pane to organize the queries.

The easiest way to get started is by using one of the examples in the Help pane.

For instance, the first check can be used to find every interface whose admin status is UP but operational status is not UP.

The query iterates through all interfaces on all devices of any vendor and returns the device name, the interface name, the admin state,  the operational state, and finally, the violation field is set to true if the admin state is UP but the operational state is down for the given interface.

Fig 3. Edit query

Simply copy the example of your choice by clicking on the Copy query icon, paste it in the Editor pane and optionally click on Prettify to properly align all the lines in the query to make it more readable.

The NQE Editor supports many useful features like auto-completionauto-save and, automatic error fix suggestions based on the Data Model among the others.

When you are done editing a query, select the Network and Snapshot you want to run the NQE query against and click on Calculate to see the query result.

Fig 4. Query results

Commit it

All the changes made to a query are automatically saved but they are visible only to you in the NQE Library application.

You need to commit the NQE query to make them available to everybody in your organization as well as to be used as NQE Verification Checks.

Fig 5. Commit a query

Use It In Verify

A quick refresh on NQE Verification Checks: they are formulated as queries for violations.

If the query finds no violations, then the check passes. If the query finds violations, the check is marked as failed, and the identified violations form the failure diagnosis.

By default, all the NQE queries published in the NQE library are disabled (inactive state) in the Verify NQE page.

To turn an NQE query in an NQE Verification Check, simply enable it by clicking on the toggle button on the left side of the query.

Fig 6. NQE Verification Checks in the Verify application

If the NQE Verification Checks fails, you can see the violations as well as the queries by clicking on the Failed status link.

Fig 7. NQE Checks result

In a networking world that is moving at a steady pace toward network automation and network-as-code, versioning of code, configuration, intent, etc. has become a prerequisite for adoption. The same concept applies to NQE queries. 

Every time an NQE query is modified and published, a new version of the query is made available in the Verify NQE application. You can select a specific version or always run the latest version available via the Version used option:

Fig 8. Query versioning

I always try to stay away from product roadmaps in customer-facing publications but…rest assured this is not the last time you are going to see a blog on NQE, so stay tuned!

In the meantime, check this demo video out and happy querying with the Forward NQE Library!

What do we do?

At Forward Networks we build digital twins for computer networks. Enterprises with large networks use our software to build a software copy of their network and use that for searching, verifying, and predicting behavior of their network. It is not a simulation. It is a mathematically accurate model of their network.

Why is it a hard problem?

Large enterprise networks contain thousands of devices (switches, routers, firewalls, load balancers, etc). Each of these devices can have complex behaviors. Now imagine a large graph with thousands of nodes where each node represents one of these devices and the links between nodes show how they are connected. You need to model exactly how traffic originating from edge devices is propagated through the network.

To do so, you need to understand the exact behavior of each device in handling different packets. A typical enterprise network not only includes different types of devices (routers, firewalls, etc), but they are built by different vendors (Cisco, Arista, Juniper, etc) and even for the same device type from the same vendor, you typically see many different firmware versions. To build a mathematically accurate model you need to model every corner case and a lot of these are not even documented by vendors.

At Forward we have built an automated testing infrastructure for inferring forwarding behavior of devices. We purchase or lease these devices; put them in our lab; inject various types of traffic to them and observe how these devices behave.

Where are we today?

I’m proud to show off publicly today, for the first time, that we can process networks with more than 45,000 devices on a single box (e.g. a single ec2 instance). Here is a screenshot of an example network with about 45k devices:

Some of our customers send us their obfuscated data to help us identify performance bottlenecks and further improve performance. It is a win-win scenario. Our software gets better over time and they get even faster processing time. The data is fully obfuscated in that every IP and MAC address is randomly changed to a different address and every name is also converted to a random name and these mappings are irreversible. These changes do not materially change the overall behavior of the model and the obfuscated data is still representative of the complexity and diversity of network behaviors of the original network. The network in the above example is built from those data.

This network includes more than 10^30 flows. Each flow shows how a group of similar packets traverses the network. For example one flow might show how email traffic originating from a specific host and destined to another host starts from a datacenter then goes through several backbone devices and finally arrives at the destination data center. 

Each of these flows can be complex. If we were to spend 1 microsecond to compute each of these flows, it would still take us more than 10^17 years to compute this. But with a lot of hard engineering work, algorithmic optimizations and performance optimizations we are able to process this network in under an hour and we are capable of processing this on a single box. You don’t need a massive cluster for such computation. The best part is that the majority of the computation scales linearly. So, if customers want faster processing speed or higher search and verification throughput they can use our cluster version and scale based on their requirements.

How long did it take us to get here?

Forward Networks was founded in July 2013. Our founders are Stanford PhD grads and as a result the very first test data that we got was a 16 device collection from part of the Stanford network. I joined Forward in Sep 2014 after spending a couple of years building and working with large distributed systems in Facebook and Microsoft. I started leading the effort to scale our computation to be able to finish the computation of that 16 device network in a reasonable amount of time and it took us about two years to get there (Mar 2015).

Then almost every year we were able to process a 10x larger network. Today, we have tested our software on a very complex network with 45k devices. We are currently working on further optimization and scaling efforts and our projection is to get to 100k devices in Dec 2020. The following graph shows our progress our last couple of months and the projection till Dec 2020 on logarithmic scale:

Lessons learned

It takes time to build complex enterprise software

As I mentioned above, we started with data of a very small network. As we made our software better, faster and more scalable, we were able to go to customers with larger networks to get the next larger dataset; find the next set of bottlenecks and work on those. We had to rewrite or significantly change the computation core of our software multiple times because as we got access to larger data we would see patterns that we hadn’t anticipated before.

Could we have reduced the time it took us to get here if we had access to large data on early days of our start up? Yes. Was it feasible? No. Why would a large enterprise spend the time to install our software, configure their security policies to allow our collector to connect to thousands of devices in their production network to pull their configs and send the data to a tiny startup that doesn’t have a proven product yet? It is only going to happen if it is a win-win situation. Every time we got access to the next larger dataset from a customer, we optimized our software based on that and went to other customers that had networks of that size where our software was already capable of processing all or majority of their network and when they shared this data with us. We would either find new data patterns that needed to be optimized or combine all the data we had received from customers to build larger datasets for scale testing and improvements. It is a cycle and it takes time and patience to build complex enterprise software.

Customers with large networks typically have much more strict security policies which means that they wouldn’t share their data with us. This is why we had to spend the time and build data obfuscation capabilities in our software to allow them to obfuscate their data and share the result with us which would reveal the performance bottlenecks without sharing their actual data. Some customers have such strict policies that even that is not possible and for those we have built tools that aggregate overall statistics which are typically useful for narrowing down the root cause of performance bottlenecks.

When selling enterprise software, customers typically don’t spend a large amount of money on a software platform if they’re not 100% sure that it would work for them. There is typically a trial or proof-of-concept period where they install the software and evaluate it in their environment. In our early years, we worked very hard with our first few trial customers to make the software work well for them. There were cases which didn’t end up in immediate purchase but their data gave us invaluable insight in improving our software.

On-prem software should work on minimal hardware

These days it is pretty easy to provision an instance in AWS, Azure or other cloud providers with 1TB or more RAM. But you would be amazed to know how many times we have had to wait for weeks or months for some customers to provision a single on-prem instance with 128GB or 256GB RAM. Large enterprises typically allow provisioning small instances pretty quickly. But as soon as your software needs a more powerful instance, there can be a lot of bureaucracy to get it done. And remember, during the initial interactions with customers, you want them to start using your software quickly to finish the proof-of-concept period. During this time, they are still evaluating your software and they haven’t yet seen the value in their environment. So, if someone in a large organization opens a ticket to the infra teams to provision a software he/she wants to try, it may not be among the highest priority tickets that would get resolved.

At Forward Networks, we have learned to be very careful with any new tool, framework or dependency we add to our system. In fact our resource requirements are so low that our developers run the entire stack on their laptops which is very critical for fast debugging and quick iterations.

We have also spent a lot of engineering time and effort on making this possible. Here are some of the high level approaches:

When you need to scale to 1000x or 10000x, you can’t simply use a cluster with 1000 nodes. Even if it is possible, there is no economic justification to that. You have to do the hard engineering work to get the same done with minimal resources. Majority of our customers run our software on a single box. But we also provide the cluster version for those customers that want to ensure high availability or have more concurrent users and want to have higher search or compute throughput. 

One of our customers was telling us that they had to provision and operate a few racks of servers for another software (in the same space as us but not exactly our competitor) and how they were pleased and amazed on what our software delivers with such low requirements. Of course not only this can speed up adoption of the software, it saves customers money and allows you as a software vendor to have better margins.

Open source tools are not always the answer

In the early years of our startup, we were using off-the-shelf platforms and tools like Elasticsearch and Apache Spark for various usages. Over time it became clear that while these platforms are generic enough to be applicable to a wide range of applications, they weren’t a great fit when you need to have major customizations that are critical to your application.

For example, initially we were computing all end to end network behaviors and were indexing and storing them in Elasticsearch. But later it became clear that it is computationally infeasible to pre-compute all such behaviors to be able to store them in Elasticsearch and even if it was possible, such an index would be enormous in size. We had to switch to a lazy computation approach where we would pre-compute just enough data that would be needed to perform quick searches and at search time we would do the rest of the computation that was specific to user query. 

Initially we were trying to write plugins or customizations for Elasticsearch to adapt it to such a lazy computation approach but soon it became clear that it just won’t work and we had to create our own homegrown distributed compute and search platform.

Moving fast without breaking things needs sophisticated testing

Every month we release one major release of our software. Currently, each of these releases includes about 900 changes (git commits); and this is just going to increase as we hire more engineers. At this rate of change, we have to have a lot of testing in place to make sure we don’t have regressions in our releases. 

Every git commit is required to be verified by Jenkins jobs that run thousands of unit and integration tests to ensure there are no regressions. Any test failure would prevent the change from getting merged. In addition to these tests, we also use Error Prone to detect common bugs and Checkstyle to enforce a consistent coding style.

We also have many periodic tests that every few hours run more expensive tests against latest merged changes. These tests typically take a few hours to complete and hence it is not feasible to run them on individual changes. Instead when they detect issues, we use git bisect to identify the root causes. Not only these periodic tests check for correctness, they also ensure there are no performance regressions. These tests upload their performance results to SignalFx and we receive alerts on Slack and email if there are significant changes.

Are we done?

While we believe we have already built a product that is a significant step forward on how networks are managed and operated, our journey is 1% complete. Our vision is to become the essential platform for the whole network experience and we have just started in that direction. If this is something that interests you please join us. We are hiring for key positions across several departments. Note that having prior networking experience  is not a requirement for most of our software engineering positions.

If you operate a large-scale complex network, please request a demo to see how our software can de-risk your network operations and return massive business value.

Top cross