Data center interconnects, while commonplace, remain one of the more complex system design challenges facing network...
engineers. This complexity, however, does not stem from the networking side of the equation. Rather, the challenges stem from application requirements.
All too often, network engineers examine the technological side of the problem, rather than the problem's requirements, when trying to determine how best to proceed. Good network design principles would say it's better to start by asking basic questions and then carefully thinking through what is really needed.
In the network illustration below, a service needs to be moved from Host 1 to Host 2. Each of these hosts is connected to a different data center fabric. These two fabrics are connected through a link, provisioned by a service provider, between two routers connected to the fabric edge.
The first question to ask is this: Is an Ethernet, or Layer 2, extension really needed? Most of the complexity in data center interconnects (DCI) is a result of trying to make an extended physical environment look like a single Ethernet domain. But Ethernet was not designed to stretch across long geographic areas and interconnect thousands of hosts.
While some technologies allow these types of interconnections, it is important to consider how those tools should be implemented and where, rather than simply pushing complexity into the network. Just because something can be done does not mean it should be done.
Preserving IP addresses
The most common reason given for mobility at Layer 2 is to preserve the IP address for a service, or application, while it is being moved from one data center fabric to another. Service mobility, however, has three components.
The first is how clients find the service. Is it through a name or the IP address itself? If the service is discovered through its name, then it may be possible to use some naming specification, such as dynamic domain name system, to allow the service to move without requiring the IP address to remain the same.
Domain name system (DNS) is often seen as very slow. But within a company or a data center, there is no reason DNS should be balky. DNS timers can be tuned, and load balancing and other capabilities can be engineered to permit quick failover between two copies of a service with two different IP addresses.
The second component is moving the IP address itself from one location in the network to another. The simplest way to do this is to connect the old and new physical devices to the same physical segment. While the lower-to-IP-layer mappings will need to be relearned, everything else remains the same.
The dangers of stretched Layer 2
In the past, applications were designed with the assumption that different parts of the application were directly connected via an Ethernet segment. Applications should no longer be written this way, although they often still are. It is tempting to throw complexity at the network forever to solve this problem, but this ignores the very real tradeoffs involved in doing so.
Adding stretched Layer 2 makes the network more convoluted, thus causing it to become fragile. Stretched Layer 2 might trigger a higher rate of network failures, ultimately costing the enterprise more in lost availability than it would have taken to think through how to make the application work better.
Another consideration is Layer 3. How might the application's inability to work at Layer 3 affect its performance across the network? If the application's developers designed it to only work across a short-run local link, assumptions may have been made about timers, flow control and other factors.
As a result, the application might work somewhat fine across a geographically stretched segment, but it might also be that the application is running close to the edge of some performance metric that will ultimately cause a failure. There is no real way to know the difference.
The third component of service mobility is discovery. How do different instances of the service find one another? Even if some form of Layer 2 mobility is indicated, it might be possible to break up a single logical Layer 2 segment into multiple broadcast domains. This not only simplifies the design of the network, but also reduces the size of various failure domains, increasing the mean time between failures.
Again, applications should no longer use Layer 2 discovery mechanisms of this type; if they do, then you need to be very careful about the design of the application.
Security and telemetry concerns
On top of these components, other DCI challenges lurk, especially around security and telemetry.
There seems to be a common assumption that if traffic is being carried across a virtual circuit provisioned by a provider, it is secure. This is a bad assumption. Tunneling adds a new header, but the new header is not an armor coating preventing attackers from intercepting or examining the flow. Security must be dealt with explicitly, especially when dealing with flows originally designed to be transmitted across a single, short segment contained within a single physical facility.
Telemetry, meantime, is essential. How will you know what kind of delay and jitter you may experience across the stretched link? How will you know when, where and how many packets are dropped? How can you determine if packets are being delivered out of order? Telemetry and management are critical components of inter-DC connectivity that need to be addressed before there are problems.
Data center interconnects are often considered a solved problem. Yet, it's valuable to re-examine the issues you may be facing with your DCI technology to determine -- carefully -- if adding complexity is the best approach to take.