Problem solve Get help with specific problems with your technologies, process and projects.

Unpredictability replaces downtime as enemy No. 1

Catastrophic network downtime has become less common as improvements in protocols and software have increased IP network stability. Despite this, networks remain unpredictable.

Alex Henthorn-Iwane

Catastrophic network downtime, once a primary problem facing enterprise network managers, has become less common as improvements in protocol and software implementations have increased IP network stability. Yet, despite the higher rates of uptime, IP networks remain unpredictable. They are plagued with brownouts, intermittent disruptions and performance degradations that exact a high productivity cost on users and applications -- especially sensitive applications such as voice and video over IP.

Unpredictable network behavior occurs even when traditional network management systems report that every device in the network is working properly. Compounding the problem, most such behavior leaves no audit trail. Network engineers often can't even explain what happened, let alone why.

In this regard, the genius of IP's distributed intelligence also turns out to be one of its greatest challenges. Routers exchange reachability information using various routing protocols (OSPF, EIGRP, BGP), then make their own decisions about how to forward packets to their destination. Should any link or node fail, the routers automatically redirect traffic to alternate paths, skirting the failed element. The distributed intelligence in IP networks creates a dynamic topology that controls how the network behaves and how traffic flows.

This efficient mechanism makes IP networks highly resilient while letting them scale economically. But the complex interactions between devices in this dynamic topology also give rise to network-layer problems when there is an accumulation of misconfigurations, network changes and/or software bugs. These inter-device problems are logical in nature and tend to have network-wide effects -- often latent effects that exhibit symptoms only intermittently or under certain adverse circumstances.

Conventional network management doesn't detect or aid in the diagnosis of dynamic, network-layer problems because, rather than analyzing the systemic operation of the network, they focus on the individual physical device elements and their status, such as CPU utilization, memory allocation, and interface up/down status. This ignores the far greater number of logical network-layer elements – such as network prefixes, routing protocols, routing events, router-to-router links and the routes themselves -- associated with the complex interactions between physical devices.

Traditional SNMP-based device management systems are unable to manage the dynamic topology and its logical elements because network-layer events happen too rapidly to be caught by standard polling cycles, because they lack the network-wide topology awareness to correlate dynamic network-layer problems, and because the sampling techniques used by SNMP management can't cope with huge volumes of events -- sometimes over a million stemming from a single root cause.

In the worst case, logical errors can lead to route loops or black holes that result in network service outages. Short of that, performance can degrade when application traffic follows errant routes masked by IP's self-healing capabilities, due to inadvertent traffic congestion. Redundancy can be compromised when router misconfigurations make critical links unavailable just when they are needed most. Security can be breached when legacy equipment and routes long thought decommissioned are actually still active, or extranet routing is misconfigured to allow the injection of another organization's routes into an enterprise network.

Typically, network engineers don't even notice emerging network-layer problems until users call to report disruptions or slowdowns. Lacking any audit-trail history, the engineers must manually connect to and search routers for information. Lacking automated network-layer correlation and analysis tools, they must painstakingly assemble data to track down root causes -- a task that, with some extremely verbose protocols, may not even be humanly possible.

As a result, only the most severe problems are investigated. The rest are left unexplained or cease to exhibit symptoms before analysis can take place, often contributing to future messes that network engineers must untangle. The upshot is usually a combination of longer network MTTR, failure to meet service-level agreements, inability to prevent the problems from recurring, and loss of productivity, especially among the most skilled (and expensive) staff.

An emerging technology known as route analytics taps into the information in the routing protocols themselves to understand how the network's dynamic topology is operating -- logically -- at any given moment. Route analytics solutions in many ways resemble routers. Like routers, they can actually listen to and participate in the routing protocol exchanges between routers. Unlike routers, though, they forward no traffic and thus add no network overhead. A route analytics solution can compute a real-time, network-wide routing map; monitor and display routing topology changes as they happen; detect and alert on routing events or failures as routers announce them; correlate routing events with other information to reveal underlying cause-and-effect relationships; report on historical routing events and trends; and assess the impact of possible routing changes on the network, even before they happen.

For the first time, network engineers can see the dynamic, system-wide operation of the network -- a "router's eye" view -- rather than just a conglomeration of individual device views. Loss of network-layer connectivity is detected even when device-level status is unchanged. Routing changes that go unnoticed by conventional management systems, but which impact network availability and performance, are visible in seconds.

With route analytics technology, network engineers can reduce MTTR, more easily explain troubling network phenomena, and give users back their lost productivity.

About the author:
Alex Henthorn-Iwane is senior director of marketing at Packet Design Inc. Before joining Packet Design in 2004, he spent two years as senior director of product management and marketing at CoSine Communications Inc., a provider of network-based IP services platforms enabling rapid delivery of communication services by service providers. Before that he was senior director of product and program management at Corona Networks. He has also held product management positions with Lucent Technologies and Livingston Enterprises (acquired by Lucent) and systems engineering management posts with Fibronics America.

Dig Deeper on Network management and monitoring

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.