Active path testing really works, as exemplified by Facebook.
I was just catching up on some reading and found an interesting blog post from February 2016 about Facebook's active path testing tool, NetNORAD. I recommend reading it to learn what drove it to build its own network performance monitoring tool. NetNORAD is based on a pinger and responder system that measures packet loss and latency between servers. There is a related tool, fbtracert, which is used for detailed troubleshooting and fault location when the packet stats identify a problem. Both tools are published on GitHub as an open source project. The system also relies on Scribe, a message logging system, and Scuba, a query and reporting system.
Before you say, "Oh, great. Another tool that sounds great, but is created by a fabulous team of software engineers at Facebook, and there is no way we could do this where I work," just bear with me.
Even if you don't work for a company with the resources of Facebook, you can take advantage of similar systems to provide equivalent functionality that works for your organization. Let's look at what Facebook did.
Key design decisions behind the tool
There were several key design decisions in the creation of the NetNORAD active path testing network monitoring tool. Among them:
- Ping between servers. An end-to-end ping tests the servers, their connections to the network, as well as the network itself. I see Facebook's approach as using the same networking end-to-end principles identified by Jerome Saltzer, David Reed and David Clark in their 1984 papers, "End-to-end Arguments in System Design" and "Active Networking and End-to-End Arguments."
- Pick two pingers and two responders in each rack. Having two of each type provides redundancy and allows identification of server-level and rack-level problems, as well as higher-level problems.
- Use the User Datagram Protocol (UDP). Facebook makes a very convincing argument for using UDP instead of Transmission Control Protocol and the Internet Control Message Protocol. Among other factors, UDP is simpler and it allows for direct measurement of underlying packet loss.
- Collect the data using a tool that scales as needed. In this case, Facebook's engineers built Scribe because existing tools wouldn't scale. Most enterprises don't need the same scale and can use other data collection tools.
- Store the data in a nonrelational database. Facebook encountered problems with relational database technology and wisely implemented something that would scale better. I keep seeing network management vendors whose developers feel that a relational database management system is the right approach for storing time-series data. However, the volume of network management data in a large -- not huge -- enterprise is a problem for a different type of database. At Facebook's scale, it also had to use an in-memory database to achieve the desired level of performance.
- Provide the basic visualization and reporting tools. The basic set of tools achieves the needs of most of their user community, avoiding the need for a lot of custom development.
The advantages of active path testing
I've included active path testing in my network management architecture recommendations for many years. The end-to-end visibility is very valuable. It is like having a set of active testers checking and reporting on network performance from the view of someone who is using the network. Think of it as monitoring the heartbeat of the network, like the pulsing of blood in arteries and veins.
Being able to identify paths that are behaving poorly is a big win. Once a path problem has been identified, it is easy to check Simple Network Management Protocol data for a relatively few interfaces to determine if the cause can be easily identified. As with NetNORAD, it is important to be able to send UDP packets and to be able to add quality-of-service (QoS) markings.
[Editor's note: If you need QoS-specific measurements, then remember to configure the network interfaces of the pingers and responders to allow the QoS markings.]
Several vendors provide enterprise-level products that provide similar functionality, though on a smaller scale. For all but a few enterprises, it is much more cost-effective to purchase one of these products than it is to invest the staff time to construct a system from open source projects. Take a look at, among others, AppNeta, NetBeez and NetScout (TruView Live) for products. [Editor's note: This isn't a comprehensive list.]
These vendors typically have both hardware and software versions of their probes, allowing installation on servers and endpoints, as well as stand-alone installations. The probes basically are self-managing, automatically downloading updates as needed. Using a combination of hardware and software allows identification of network problems, as well as server-side problems. For example, if there is no packet loss, high latency or high jitter -- changes in latency -- to a slow server's subnet, then look at the server's link, the server's internal functions and application dependencies on other servers.
The vendors work hard to make sure their systems scale up to handle the needs of most customers. That also means those of us outside of Facebook don't have to do that work.
Most of the active path testing tools include the ability to perform webpage access, which can provide a basic application-level ping for web-based applications. This capability is great for monitoring cloud-based services that use a web interface. Check with the provider for the ability to perform other -- non-web -- application-level pings.
Learning from Facebook
We don't have to work at Facebook to benefit from what it has learned. Few networks need the scale of processes that it needs. Instead, look at what the active path testing vendors are doing, identify the features that best match your monitoring requirements and start a proof of concept. A reasonably good system can be built with a few probes at a reasonable cost.
Choosing the best monitoring tool
Complex apps need robust monitoring
The fundamentals of network monitoring