This content is part of the Essential Guide: Network security basics for building better corporate systems
Problem solve Get help with specific problems with your technologies, process and projects.

Troubleshooting network issues in an era of reduced visibility

As networks become more complex, fundamental monitoring tools are losing their currency. But it's still possible to retain visibility and keep users happy.

The first step in fixing anything is admitting you have a problem.

Denial is a significant factor in IT tools addiction and we too often ignore creeping self-deception. It's one thing when a new IT technology emerges that requires unique tools, but it's something different and deeply upsetting when essential and unsung tools we've relied on for decades begin to fail us. And as we bet our businesses on cloud, software as a service (SaaS) and hybrid IT, venerable tools as basic as traceroute are becoming obsolete. How can admins begin troubleshooting network issues in an increasingly opaque environment with decreasing administrative authority?

A rush to opacity

One of the most seductive features of cloud and SaaS is historically anathema to engineers but thrills IT management to no end: "There's less to manage." There's nothing inherently wrong with transitioning critical services to an approach of service-level agreement confidimus -- In SLAs We Trust. The promise is well intended and service providers generally do everything they can to delight their customers -- for the most part.

The sticky bit, however, is that network engineers are still responsible for ensuring a great user experience. This transition comes even as they agree to reverse decades of progress toward rich monitoring. IT is agreeing to put systems that are at the heart of a business' success where they have little or no access to troubleshoot network issues, and limited options for tracking and reporting overall performance. And, as you've probably discovered, while Amazon, Google, Salesforce and Azure are good and getting better, they certainly don't have zero-failure, unlimited infrastructure. They're subject to the same physical laws as our data centers, and help desk tickets are still being opened.

APIs supersede SNMP

For a number of very good reasons, cloud providers aren't about to open their firewalls or allow us to monitor their software-defined infrastructures. Instead, we're forced to rely on them to provide management APIs and proprietary tools that allow us some degree of oversight in troubleshooting network issues quest. But these interfaces aren't nearly as information-rich as we're accustomed to in our own data centers; they're not easy to use; and none offer the platform agnosticism and ubiquity of ICMP, SNMP and other protocols. But what they do leave wide open are specific paths for application traffic.

Even in our internal networks, traceroute and ping are hobbled by route multiplicity, which limits their ability to troubleshoot network issues between users and servers. Traceroute assumes the path between an observer and a service is linear, returning an approximate routing path for that one test. With hybrid IT networks, internet routing hugely multiplies the problem with interconnected multi-homing and adds impedances for UDP or ICMP traffic. How, then, do you isolate the cause of degraded Salesforce performance when the issue might be huge latency in one of four links carrying 25% of your app traffic?

Firewall whisperer

The answer is to stop thinking of our carefully designed internal networks and think instead like the internet.

The answer is to stop thinking of our carefully designed internal networks and think instead like the internet. With internal networks, we design out as much uncertainty as possible; meanwhile, the internet relies on controlled route uncertainty for robustness. If instead you think like an application-specific packet, the entire traffic path from user to cloud server can be observed for multiple possible routes in multiple dimensions, including time. This technique to troubleshoot network issues isn't as immediately gratifying as a traceroute -- it takes some time to probe and spider -- but the results are both comprehensive and visual.

While poll-based monitoring of on-premise gear will continue to return operations' critical information for years to come, visual path monitoring helps us regain much of the authority we lose in the move to hybrid IT networks. It allows us not just to simplify root cause detection of issues in our internal networks from malfunction or misconfiguration, but extends troubleshooting network issues through the internet and into our service provider's networks.

This works because modern network path monitoring tools -- among them, SolarWinds' recently released NPM 12 -- simulate application specific traffic, which passes through firewalls exactly the same way as user traffic. They solve the problem of protocol- or port-specific routing through load balancers by encountering the same asymmetric multi-homed link latency issues, and they uncover all the hops that can interfere with service performance. Instead of reacting to a red icon on a router CPU, we can react to a red hop, wherever it may be. And when that's inside a cloud or SaaS provider's network, we can call their help desk with the information they need to resolve the issue and not spend all day on hold while they try to figure it out.

If we can regain hybrid IT visibility and keep users happy, maybe "less to manage" isn't such a bad thing after all when it comes to troubleshooting network issues.

Next Steps

Finding the network management tool for you

Common network errors and what to do to solve them

The evolution of network management

This was last published in June 2016

Dig Deeper on Network management and monitoring