We are in the midst of a global WAN upgrade that relies heavily on Service Provider Layer 3 VPN MPLS services, and have been experiencing performance routing and availability issues in certain sites. Reports of network routing problems like application timeouts, slowness and complete outages at certain sites (primarily in emerging markets) have become more prevalent as sites are migrated from the legacy WAN infrastructure, which was built on low speed P2P circuits, Frame-Relay, and IPSec/Internet VPN fabric.
In many of these cases, the Service Providers have blamed faulty circuits or equipment as the source of the problems, but in other cases we have been told that the problem was either “transient” or that there was “no problem found.” During one of these instances, our operations staff was logged on to a CPE router and discovered that BGP routes were present from the MPLS PE router, yet all the traffic, including PINGs, were being dropped. The operator was able to manually shut down the CPE WAN interface to the provider, allowing the site to “fail over” from the MPLS network to the legacy WAN network, which restored service. This is very concerning to our executives, particularly because we have justified the expense of keeping the legacy network in place for a backup network. Many of these sites are unmanned and certain batch applications run at night.
Are there any technologies or solutions that can automatically detect and react to these types of network routing problems? We have looked at BFD but this appears to be a link-specific solution that is not supported by certain providers. Our operations staff is considering a GRE tunnel overlay, but this will greatly increase the complexity, and will defeat many of the benefits that drove us to MPLS VPNs in the first place.
It sounds like you are encountering conditions often referred to as “brownouts” or “black holes,” where degradations in the MPLS core network are perceptible to end users, but not by the CPE routers themselves. These problems are more common in Layer 3 VPN networks than in traditional P2P, or overlay (Frame-relay, IPSEC) WAN deployments due to the additional layer of routing hierarchy inserted by the service provider. Since no two CPE sites maintain direct routing adjacencies, they must rely on control plane signaling to be initiated from the PE routers on the MPLS/VPN network. This can be much slower in improperly designed service provider networks.
You are correct that BFD (Bidirectional Fault Detection) will not solve your problem. It is limited to direct neighbor “liveness” detection and does not have mechanisms to determine end-to-end path integrity. Deploying an overlay of GRE tunnels between your CPE routers is certainly a feasible option to detect problems. GRE keepalives or tuned routing protocols can be enabled over the tunnels to detect the degradations and force failovers. However, as you indicate, these types of designs increase the complexity, and in some cases, impose performance penalties due to fragmentation and the inability to forward packets in the hardware on certain platforms.
One particular solution that may be worth investigating is Performance Routing (PfR), formerly known as Optimized Edge Routing (OER). Performance Routing is an integrated Cisco IOS solution that enhances traditional routing by using the intelligence of embedded Cisco IOS features to improve application performance and availability. PfR can be configured to monitor IP traffic flows, measure WAN path performance, and dynamically re-route traffic when network conditions degrade, or when user-defined policies dictate specific WAN exit points. PfR is able to make intelligent routing decisions based on real-time feedback from IOS reporting sources such as NetFlow data records, IP SLA statistics, and WAN link utilization. This enables an application-aware routing capability not possible with traditional routing protocols such as OSPF or BGP which are limited to one-dimensional metrics to choose a “best path.”
Depending on hardware and IOS levels that you are running on your CPE router(s), it may be possible to simply enable the PfR feature on your CPE router(s), and define a performance policy that monitors end to end path availability. It may also be possible to re-route your traffic across the legacy network when feedback from Netflow or IP SLA indicates a brownout or blackout condition is occurring. The feature is also very useful in reporting.
Take a look at the Cisco Performance Routing DocWiki http://docwiki.cisco.com/wiki/PfR:Home, for a detailed overview on the theory, use cases, and configuration samples for PfR.
This was first published in November 2011