We are in the midst of a global WAN upgrade that relies heavily on Service Provider Layer 3 VPN MPLS services, and have been experiencing performance routing and availability issues in certain sites. Reports of network routing problems like application timeouts, slowness and complete outages at certain sites (primarily in emerging markets) have become more prevalent as sites are migrated from the legacy WAN infrastructure, which was built on low speed P2P circuits, Frame-Relay, and IPSec/Internet VPN fabric.
In many of these cases, the Service Providers have blamed faulty circuits or equipment as the source of the problems, but in other cases we have been told that the problem was either “transient” or that there was “no problem found.” During one of these instances, our operations staff was logged on to a CPE router and discovered that BGP routes were present from the MPLS PE router, yet all the traffic, including PINGs, were being dropped. The operator was able to manually shut down the CPE WAN interface to the provider, allowing the site to “fail over” from the MPLS network to the legacy WAN network, which restored service. This is very concerning to our executives, particularly because we have justified the expense of keeping the legacy network in place for a backup network. Many of these sites are unmanned and certain batch applications run at night.
Are there any technologies or solutions that can automatically detect and react to these types of network routing problems? We have looked at BFD but this appears to be a link-specific solution that is not supported by certain providers. Our operations staff is considering a GRE tunnel overlay, but this will greatly increase the complexity, and will defeat many of the benefits that drove us to MPLS VPNs in the first place.
