Network troubleshooting can be challenging for a number of reasons, not the least of which is the lack of a standard methodology. In this tip, we'll look at troubleshooting methods from a couple of different angles.
Typically, when someone mentions a methodology, we think of something like the scientific method, which we might alter a bit for our purposes. Thus, we might go through some distinct phases in our troubleshooting where we would first prepare by understanding the normal, steady-state operation. Then, when the trouble occurs, we would define the problem, based on symptoms (e.g. "the network is slow" or "I cannot connect to the VAX"). Next, we'd identify the current state of the network, performing steps such as checking to see if the WAN circuits are up or collecting device logs as appropriate. Finally, we'd form a hypothesis and test it.
While a formal methodology does provide some semblance of scientific rigor for an otherwise artsy process, and it does increase the odds of success, it also has some drawbacks. Primarily, it's slow. This is because it takes time to work through the initial steps which necessarily cover a lot more ground than is relevant to the problem, since we don't yet know what the problem is. Second, it doesn't take into account the natural process of learning, e.g. "It took me two hours to figure out why the network was slow the first time Bob in Accounting ran his application, but now it's the first thing I check when users call."
Still, as you get more experience troubleshooting networks in general, and your current network in particular, you'll find this process a little tedious. So, my tip to help you troubleshoot faster is to understand the benefits of several methods and use the best of each together.
When you first become aware of a problem, you should make a conscious effort to first understand the severity or complexity of the issue. Ask yourself: "Based on the symptoms and a minute or two of investigation, is this something I've seen before? Can I fix this quickly, or would I benefit from the structure of a formal methodology?" If you choose the former, but the issue remains elusive, you should periodically revisit this question.
Next, as you work a problem, I'd suggest not starting from the top or bottom of any list and proceeding in order. Rather, do the fastest items first. For instance, starting in the middle of the OSI model with a ping is fast and immediately lets you know, if successful, that there's nothing wrong with Layers 1 or 2, and if unsuccessful, no amount of diddling at the application layer will result in connectivity. Another fast start is checking a network management console. What's red? What's green? Hopefully, you have in place an array of such tools that have a quick dashboard-style view into your network.
As an example of a list of things I'd check for a routing problem where the symptom is loss of connectivity, I'd start a ping to show that it's not working, followed quickly by a traceroute to give me a general idea of where the problem might be. Once I logged into the last router to respond to the traceroute, I'd check the routing table to see if it has an entry for destination and that the next hop points in the right direction.
The point is that each of these steps takes longer than the previous one. Do what you can do quickly first -- then as efforts get more involved, start to use a mini-"scientific method"-like approach in each step. And throughout your process, keep notes. Make them just a little more detailed than you think is necessary.
Tom Lancaster, CCIE# 8829 CNX# 1105, is a consultant with 15 years experience in the networking industry, and co-author of several books on networking, most recently, CCSPTM: Secure PIX and Secure VPN Study Guide published by Sybex.