.shock - Fotolia

News Stay informed about the latest enterprise technology news and product updates.

An in-depth look at the network in the Facebook Altoona data center

In this Q&A, Facebook describes the design, management and SDN control in its massive Altoona data center network.

Facebook recently offered a peek into the highly modular and scalable network it built for its new data center in Altoona, Iowa. The social networking giant has been very public with its approach to data center networking lately as it tries to generate a network equipment and software ecosystem around its Open Compute Project. Facebook's goal is to create a market of data center switches built with disaggregated hardware and software that offers lower costs and simplified operations. Facebook introduced a prototype switch -- Wedge -- as well as FBOSS -- a Linux-based network operating system -- for consideration in Open Compute's networking initiative.

While neither Wedge nor Open Compute play an immediate role in the Facebook Altoona data center today, it's easy to envision how they will in the future. The Facebook Altoona network achieves its scalability and modularity through simplicity and a novel architectural approach. It uses just BGP and Equal Cost Multi-Path (ECMP) routing to maintain a simple topology that is vendor agnostic, and it is a massive leaf-and-spine network in which each layer is designed to scale independently of the others.

Facebook can scale out the leaf layer in the form of compute pods with top-of-rack (ToR) switches. The network has multiple, independent "planes" of spine switches that operate in parallel. It also scales external connectivity by introducing "edge pods" to the network design. Each edge pod contains racks of high-bandwidth switches that connect the spine planes to the external network. Facebook has designed the network so that it can add spine capacity, server access capacity or edge capacity independently as needed.

Furthermore, the Altoona network has a home-grown BGP controller that can override standard BGP routing when specific applications and services need a dedicated network path. The network has a massive orchestrator that defines network configuration abstractly and pushes configurations down to individual switches automatically. And finally, the company has developed a management platform that it trains to automatically diagnose and remediate network problems.

I wanted to learn more about how Facebook did all this in Altoona, so I spoke with Alexey Andreyev, one of the Facebook engineers who designed it.

Tell me about the design goals of the Facebook Altoona network.

Alexey Andreyev: We are trying to address the questions of being able to scale and to remove the limits of oversubscribed networks -- to provide capacity within the data center to whatever extent we require it, and to be able to scale in different dimensions in a very simple, uniform framework. When we need forwarding capacity, we add spine switches. When we need external connectivity, we add edge pods. This is a very structured and predictable framework of how we need to go toward infinite capacity targets. We want the network to be simpler for us to deal with, independent of increases in topology or complexity.

How have you achieved this simplicity operationally?

Andreyev: One key aspect is the software approach [we use] to create a fabric, [which] completes the automated management. We configure the data center; we don't configure individual components. And the component configurations are pretty much created from the high-level settings by our systems. We define what the data center would look like, what is the specific fabric topology form factor for it. All the necessary specifics such as individual device configurations, component configurations, port maps for how many devices need to be connected, [and] addressing routing policies -- all that is derived from high-level settings and interpreted into particular platform settings.

How do you do that interpretation?

Andreyev: We define a very simple translation into whatever applicable device. We look at overall network services and what is the desired routing state, what the desired topology [is], and the properties of the topology. From that we create fully abstracted, vendor-agnostic, logical views of each component. Then we translate that into a particular implementation. For example, a switch has X ports and the ports have X addresses, as opposed to having fixed routes. This is pretty universal then when we generate this logical picture, we just interpret it into whatever the actual component is.

What are you using to communicate that high-level policy down to the switches?

Andreyev: We use whatever is applicable for specific components. We deploy the configuration that is prepared for these components in our repositories. We have mechanisms to schedule these operations on broad ranges of components. One of our philosophical goals is to keep it as simple as possible. Our automation and the forwarding plane are pretty much disaggregated, so we can replace any component as we need without the need to drastically change up software. And we can adapt new types of components very quickly.

Your vendor-agnostic approach aligns with the goals of a lot of data center operators, particularly members of the Open Networking User Group. How would this approach translate into the broader data center industry?

Andreyev: One key element is to keep it simple, demanding a minimum necessary amount of functionality for a component, and minimum necessary actions to deploy this functionality and operate the component. For example, for any component, we define an action [by gracefully taking] it out of service or gracefully [putting] it back in service, or to deploy the configuration and so on. [We find] the common denominators among many different types of things. And the simpler it is, the more commonalities there are. That's why we are talking about it. It's relatively simple, and we think everybody can benefit from it.

You mentioned that, aside from ECMP, BGP is the only routing protocol you are using in the Facebook Altoona data center. Is that an example of how you are keeping it simple?

Andreyev: Yes. Also we are using a minimum set of features to operate this topology. [BGP] is everywhere throughout the network, from the rack switch uplinks to the edge.

How does Altoona's BGP controller work?

Andreyev: Each of [the switches] has the ability to talk BGP with the controller. There are two basic components. There is an ability to see what BGP routing information the [switch] has, and the ability to inject routes to construct a path from device to device, end to end. We designed the routing in such a way that the majority of our flows, both BGP and ECMP, work very well. But if we need a custom path that is different from the BGP decisions, we can deploy it hop-by-hop using the controller functionality. And we can do it very quickly. What decides how we do it is pretty much a software decision.

Does the BGP controller make decisions automatically, or are those decisions made by engineers?

Andreyev: We can make it either. Because our fabric topology and all the configurations in the fabric are software-created, we have full insight into what the routing should be across the domains, and which components are supposed to have which software information. We have very quick and flexible ways to adjust specific routing paths by knowing everything. So we have the information of how it should be. We have controller insight into what each individual component has, and we have the controller ability to push specific routes to different points on the network to program whatever paths we need.

Would you describe this BGP controller as an SDN controller?

Andreyev: It's acting based on software decisions, so it fits into the description. Everything about the fabric is software-driven. There is no such thing as logging in to define and configure something. There is no such thing as creating configurations for individual devices. When we define the network, we work with configuration at a very high level for the whole network. Everything else is derived from software. We can define many ways to implement it.

The same thing goes for operations, monitoring, and troubleshooting, because most of the problems are resolved by robots. We rarely work with individual devices. When we add new devices, we have mechanisms for discovering the device's role in the topology and deploying configurations from a repository.

So you have built an automated troubleshooting platform for this network?

Andreyev: Yes. When we detect a problem, the system looks at the problem, identifies that it's a known problem, fixes it and notifies the operator that it fixed it. When new problems are discovered after fixing it, we look at how we make it auto-discoverable. So the next time it occurs, the same auto-remediation functionality comes along and fixes it. It's pretty much in line with how we've been managing servers. We've been using FBAR [Facebook Auto-Remediation, for server operations], and we're striving to make networking manageable by the same means and concepts as our servers. Whatever the problem is, or whatever specific set of actions are needed to remediate that problem, it's specific to that problem. It's a framework for generically detecting things and introducing actions to address those things -- and validating those actions and reporting on that.

It's not actually about the tools; it's about the concept. It's about what makes operations easier, [which] is the concept of having insight into issues by various means -- whether SNMP polling or sys-logging or something else -- and having the ability to react to those events and having the ability to add more remediation actions and notify people about the progress of these things. It's more about this overall framework of how we do it, and how we make such a large-scale and distributed network more manageable.

Dig Deeper on Data Center Networking

Join the conversation

1 comment

Send me notifications when other members comment.

Please create a username to comment.

So there is not just BGP, there is also SNMP running and there are logging servers. The question I have is how do they manage to convert their SNMP traps into automated system and network changes? It seems they would also need to have a data dictionary that would have the scripts for each type of device and each administrative action that is defined as part of the solution set for each failure/error occurrence. How this was done is what would be very helpful to copying the approach. Can this be asked up the chain of communication?
Alston Davis, BAH Senior Lead Technologist