There has been lots of hype around using the Transparent Interconnection of Lots of Links (TRILL) protocol for bridging and multi-pathing in data center networks, but a data center architecture with TRILL as the foundation ignores important emerging trends, including the need for varying types of virtual LANs (VLANs) for segmentation in cloud networks.
VLANs in data center network architecture
Virtual LANs (VLANs) have many applications and benefits but are most commonly used in the enterprise data center as a segmentation mechanism. They can be used to segment tiers of a single application or different applications and business units to provide failure isolation. They can enable administrative separation, secure the infrastructure and also comply with regulations. Today VLANs are also being adopted in the cloud to separate users. It is becoming increasingly clear that over the next decade, any architecture used for private and public clouds will need to allow for seamless and controlled communication within and across VLANs.
Because of their importance in the data center, it’s a good idea to look at the key scale properties of VLANs. Here are a VLAN’s three orthogonal scaling properties:
1. Size of a VLAN: This relates to the number of VLAN members. A large VLAN is useful to build a Layer 2 cluster for single applications, such as HPC or search, where the application is owned by a single entity (single-tier trusted application).
2. Number of VLANs: Standard Ethernet provides for 4K VLANs. If VLANs are also used to separate individuals, the number of VLANs required will far exceed 4K.
3. Stretch of a VLAN: This governs the locality (or lack) of the logical interfaces that are part of a VLAN. With virtualization, when an application has been separated from a physical device, it can be instantiated anywhere. However, the application still needs to retain its IP address and the associated policies (VLANs are an effective container for specifying policies).
These scaling properties are represented graphically below:
Automated VLANs for virtualization
Using available technology, it is possible to treat these three properties as independent attributes and scale them orthogonally. Additionally, the provisioning of VLANs can be automated and made to be in lock-step with the instantiation of VMs. This coordination between the Virtual Machine Manager and the Fabric Manager can be realized either in the management plane or in the control plane through in-band signaling protocols like VDP (part of the 802.1Qbg standard).
TRILL in data center network architecture: The tangled love affair
Data center network architectures and protocols must take a holistic view of the data center environment, but TRILL has a very narrow scope of solving problems related to the legacy deployments of Layer 2.
TRILL was originally designed to be an enhancement to the Spanning Tree Protocol (STP) in distributed campus networks. Compared to STP networks, TRILL networks have enhanced forwarding at Layer 2, enabling the use of multiple paths between endpoints, thereby allowing for better bandwidth utilization in the network. However, the scope of TRILL (perhaps in part due to its campus pedigree), is limited to Layer 2, which makes it problematic as a solution for the enterprise and cloud data center networks.
IP forwarding (which is needed when traffic has to flow between VLANs) is at best an afterthought in the TRILL architecture. It is handled as a one-armed service, with the service attachment point quickly becoming a bottleneck. Data center network architects using TRILL thus face a dilemma: in order to get the Layer 2 multi-pathing benefits of TRILL, they have to drastically reduce the number of VLANs in use as the act of crossing VLANs is expensive in TRILL networks. This, in turn, compromises the critical segmentation benefits of VLANs in data centers. Put simply, TRILL allows for a large physical stretch of a VLAN but that comes at the expense of being forced to use large VLANs (those with lots of members). While the VLAN boundary problem is perhaps TRILL’s Achilles heel, it has several other shortcomings that should give network architects a cause for reflection:
1. Economics: Conceptually, TRILL core switches are supposed to provide simple transport functionality; however, they remain complex because TRILL has a narrow focus on the Layer 2 unicast problem. This means that a TRILL architecture will have multiple tiers where complex processing (Layer 3, multicast, FCoE, congestion management, etc.) is required. Subsequently this can result in overall poor economics of the data center architecture.
2. Security and failure isolation: Security and failure isolation are a real concern in the TRILL architecture. Both issues stem from being artificially forced into large broadcast domains. Flapping interfaces, misbehaving or malicious applications and configuration errors can cause widespread damage and in a worst case scenario result in a data center meltdown.
3. Layer 3 multi-pathing: While TRILL solves multi-pathing for Layer 2, it breaks multi-pathing for Layer 3. There is only one active default router with Virtual Router Redundancy Protocol (VRRP), which means that there is no multi-pathing capability at Layer 3.
4. FCoE on top of TRILL: FCoE on top of TRILL is architecturally and operationally challenging. Vendors have found significant scalability challenges with distributed Fibre Channel services (the whole reason for NPIV or FC gateways was to limit the number of Fiber Channel Forwarders (FCFs) in the network). VSAN-to-VLAN mapping and configuring VLANs for FCFs to discover one another are some of the operational challenges. All of these are being actively debated in T11 under FC-BB-6. The conclusion has been that only I/O convergence is possible with FC-BB-5.
5. Multi-tenancy: TRILL does not specify how overlapping name spaces must be handled, nor does it offer a solution for more than 4K VLANs.
6. Congestion management: The address plane of the TRILL core is different from the address plane of the TRILL edge. Congestion detected in the core must be signaled and forwarded to the source of the congestion (end host). The TRILL core has no idea about end points; it only knows about TRILL edges. Therefore congestion management schemes like Qcn won’t effectively work in this architecture.
7. Interoperability: To achieve interoperability, a fabric will have to be standards compliant in the data plane, control plane, services plane and management plane. Different implementations pose a threat to interoperability.
In the final analysis, TRILL is an incremental approach that increases complexity while addressing only a fraction of the problems faced by modern data centers. Users need a solution that reduces complexity while addressing the full spectrum of data center issues. In other words, they need a flat, data center-wide fabric that delivers exponential improvements in speed, scale and efficiency, achieved by removing legacy barriers and improving business agility.
Read more about TRILL and its challenges in the data center.
About the author: Anjan Venkatramani is Vice President of Product Management, Data Center Business Unit, Fabric and Switching Group at Juniper Networks. In this role, Anjan heads product management and strategy for Juniper’s QFabric cloud computing initiative.
Anjan holds numerous patents in network and data center fabrics, memory design and security services architectures. Prior to joining Juniper Networks, Anjan started his career at Siemens Nixdorf building supercomputers; work that produced several journal publications and patents in multi-processor interconnects. Anjan also worked for a start-up in the area of signal processing for DSL and Wireless applications.