If you keep up with data center design, you certainly know that most large-scale -- hyperscale or web-scale --...
fabrics use the Border Gateway Protocol, or BGP, as the primary routing protocol on their data center fabrics; for instance, see RFC 7938 and LinkedIn's description of its data center fabric. If you follow the hype cycle, you are probably certain that many of these same network infrastructure operators are moving toward a pure software-defined network, or SDN, play in the same data center fabrics.
Or are they? Four recent drafts published in the Internet Engineering Task Force (IETF) imply another movement is afoot in the data center fabric world: using some form of modified link-state protocols on large-scale data center network architectures.
- Routing in Fat Trees describes a number of modifications for the Intermediate System to Intermediate System (IS-IS) routing protocol, specifically modifications that will convert the operation of IS-IS to be somewhat closer to a distance vector protocol and send only a minimal amount of information to the leaf -- or top of rack, or ToR -- routers in a large-scale fabric.
- IS-IS Routing for Spine-Leaf Topology describes a set of modifications to the IS-IS protocol so a router can advertise it is connected to hosts, or servers -- in other words, the advertising router is a ToR switch in a data center fabric. The draft then provides for the advertisement of minimal information to ToR switches -- specifically a minimal amount of topology information and a default route.
- Openfabric discusses a set of new mechanisms through which a router in a five-stage, or larger, spine-and-leaf fabric can determine its location within the fabric to help auto-configuration, as well as a set of flooding optimizations that allow IS-IS to reduce the flooding load in a large-scale, highly meshed topology -- specifically a spine-and-leaf topology.
- BGP-Shortest Path First describes a system that uses link-state information carried in BGP through data center fabrics to calculate the best path through the network, rather than the standard BGP best-path algorithm.
Why the focus on link-state protocols?
Why the sudden interest in something other than BGP and SDN options within the IETF? The primary reason seems to be that hyperscale and web-scale operators have determined that pure-play SDN is not the only, or perhaps the best, path forward in these large-scale environments. None of these operators seems intent on abandoning SDN technologies; instead, they seem intent on combining SDN with more traditional approaches to build a sort of hybrid -- perhaps best called a programmable network infrastructure. The idea is to combine a simplified distributed routing protocol in parallel with an SDN-type design to solve two very different kinds of problems.
Specifically, in each of these cases, some form of distributed protocol is being proposed. The primary advantage of a link-state routing protocol has been its ability to build a complete view of the network topology. While controller-based SDN can build a full view of the network topology, the process is less dynamic, and it can face serious scaling and convergence speed issues. Using link-state protocols to discover the base topology takes advantage of existing protocols and experience to solve one part of the control-plane puzzle.
Why not native BGP, without extensions for carrying link-state information? While BGP has proven to work well in many large-scale data center network architectures, BGP is a path-vector protocol that does not provide information about the fabric topology natively -- hence the proposed extensions for carrying link-state information in BGP, as well as the proposal to actually use this information to build a set of shortest paths through the network.
Further, some engineers believe BGP is very heavyweight; it's a protocol designed to interconnect multiple operators with a focus on policy, rather than reachability. BGP implementations tend to run to hundreds of thousands of lines of code, which means they tend to be very complex. Even in data center fabrics that do not use many of these features, the code is still running and available on fabric routers.
Three of the proposals listed above are based on IS-IS -- link-state protocols that are not widely used in data centers today. IS-IS, however, tends to be very scalable without modifications, and it also tends to work well with lightweight implementations. IS-IS runs natively over Ethernet, reducing the complexity of configuration by removing the need for even link-local addressing on fabric transit links.
Are these drafts the beginning of movement away from BGP in large-scale data center fabrics? They seem to be more of a move toward simplification of the control plane. It is likely that one protocol really cannot solve every problem in large-scale data centers; the movement now is toward splitting out functionality among several simpler protocols.
For those outside the hyperscale and web-scale spaces, these drafts are worth watching, as they might be pointing the way to the next stage of large-scale data center control planes -- perhaps even the real result of the work that has gone into SDNs.
Five ways to better manage your data center network
Evaluating the top data-center-class switch features
BGP still lacking security foundation