Complex routing problems have been shown to cause more than half of all IP network downtime. And more than half of all enterprise routers run Cisco's EIGRP routing protocol (Enhanced Interior Gateway Routing Protocol). So finding ways to more effectively troubleshoot EIGRP networks and minimize routing problems is a major concern for today's enterprises.
EIGRP offers fast convergence time and can scale to thousands of routers, spanning multiple autonomous systems. These attributes, along with Cisco's dominance in the router market, have made EIGRP the de facto standard in large enterprise networks.
However, one particularly frustrating troubleshooting issue that EIGRP network engineers confront is known as a "stuck in active" (SIA) condition, where an EIGRP router fails to receive a reply to a routing-path query from one or more of its "neighbor" routers within an allotted time (typically three minutes, depending on the version of Cisco's IOS in use). An SIA condition can cause EIGRP routers to drop neighbor adjacencies, negatively impacting service. Worse, if the dropped adjacency causes other SIAs, the problem can cascade throughout the network, affecting large numbers of users and causing major network outages.
To stave off a broad network reconvergence -- or, at minimum, a reset adjacency leading to costly downtime -- network administrators must locate the non-responding router before the active timer has expired. This tedious manual process involves starting at the router where the SIA most recently occurred, proceeding to the nearest neighbor router, and following the problem, router by router, back to its source. However, this effort frequently comes too late because, once the timer has expired and adjacencies are reset, it becomes virtually impossible to identify the router responsible for the original SIA condition; the network's forensic "audit trail" has been erased. A senior network architect at a major pharmaceutical company likened the problem to "tracing footprints in the sand." In most cases, an administrator cannot possibly notice an SIA event and react in time to find the source of the problem before the trail vanishes, with one likely result being a future recurrence of the problem. What's more, in a large network, an SIA condition that causes downtime can have serious bottom-line consequences.
Whether or not the SIA condition is caught and remedied before adjacencies are reset, it is critically important to discover why the router failed to respond to queries and why the original route(s) went active in the first place. Common SIA triggers are flapping links, overloaded routers, and failure to configure route summarization. Routing configuration errors are especially problematic in large networks, where they can rapidly proliferate conditions that precipitate further SIA events.
In his book IP Routing (O'Reilly, 2002), Ravi Malhotra says, "the best preparation for troubleshooting [an EIGRP] network is to be familiar with the network and its state during normal (trouble-free) conditions." He recommends that network engineers possess detailed knowledge of routing tables, summarization points and routing timers, plus extensive "what-if" scenario plans.
A new technology called route analytics for the first time gives EIGRP network administrators just this sort of knowledge. Route analytics solutions work by listening passively to all routing exchanges on the network and delivering a "router's eye view" of Layer 3 activity -- a complete and accurate network-wide topology map showing all EIGRP routes and events, both real-time and historical. A complete prefix advertisement history is maintained for the network, providing an audit trail that includes prefix type, AS of origin, metrics and more. These events are then resolved into the link-state change events that caused the EIGRP updates. Network engineers now can proactively monitor for faulty routing behavior caused by configuration errors or other problems. They can, for example, verify that all remote access routers are configured with route summarization to prevent rapidly changing route advertisements from precipitating SIA conditions. Misconfigured routing redundancy can be caught before it causes routers to go active on important routes. And any detected active route is automatically watched; if it doesn't return to passive state soon, possible active/SIA query paths are probed to determine whether an SIA condition exists and, if so, where it originated.
Perhaps most important for SIA scenarios, route analytics can restore the missing "audit trail" normally lost after reconvergence, so that an SIA can be traced even if it has temporarily stopped occurring. Route analytics tools can record all EIGRP routing events to a database and create an event log that administrators can use like a VCR, rewinding and playing back to show the end-to-end EIGRP network at any time, stepping through past routing events to determine the root cause of an SIA condition. No more "footprints in the sand."
Furthermore, what-if analysis tools within route analytics let the administrator analyze the potential impact of various events, such as configuration changes, that might trigger SIAs. This is simulated on the actual, as-running network, rather than on an outdated and possibly inaccurate model.
With new route analytics tools, enterprise network operators greatly improve their chances of capturing the information they need to effectively troubleshoot SIA conditions. They can keep many SIAs from happening, and can detect and diagnose those that do occur more efficiently than ever before. The result is less costly network downtime and a freeing of resources to focus on proactively improving service availability.
About the author:
Alex Henthorn-Iwane is senior director of marketing at Packet Design Inc. Before joining Packet Design in 2004, he spent two years as senior director of product management and marketing at CoSine Communications Inc., a provider of network-based IP services platforms enabling rapid delivery of communication services by service providers. Before that he was senior director of product and program management at Corona Networks. He has also held product management positions with Lucent Technologies and Livingston Enterprises (acquired by Lucent) and systems engineering management posts with Fibronics America.