This text is excerpted from the eBook Tips and Tricks Guide to Network Configuration Management, Chapter 3: Network Management Troubleshooting
The book from which this chapter is excerpted presents tips and tricks for four network configuration management topics. For ease of use, the questions and their solutions are divided into sections based on topic, and each question is numbered based on the topic, including Topic 1: Change Management Best Practices, Topic 2: Network Management Security, Topic 3: Network Management Troubleshooting, Topic 4: Change Management Techniques, Topic 5: Selecting and Deploying a Network Device Management Solution, and Topic 6: Enterprise Network Device Management.
To download/read the eBook in its entirety, visit: http://www.alterpoint.com/ebook
Topic 3: Network Management Troubleshooting
Q 3.1: What is the first step toward fixing a router that isn’t working?
>> One change at a time, please! The idea of using a known-good backup to recover from a device failure only works if you tend to make a small number of changes at a time, let them settle to ensure that they’re working properly, then immediately make a backup. If you’re in the habit of making a raft of changes at once, you’ll have a much more difficult time tracking down the change that caused the problem.
>> There’s no such thing as a minor change! Every single change to your network devices should go through your change management process. No change is too minor. We’ve all heard the story about the technician who blew dust out of a router’s cooling fan. He blew hard enough to stop the fan, causing the router to overheat and restart itself at seemingly random intervals. Had that simple maintenance action—cleaning out the router—been logged as a change, a senior administrator might have guessed that the problem was in the cooling fan, and checked that out first for a speedier resolution to the problem.
Q 3.2: How can change management contribute to improved network performance?
A: Managing large networks is a complex, difficult task. Suppose you took a job at a large corporation with tens of thousands of users spread across dozens of offices. Your job, you're told, is to find out why network performance is slow. Where do you start?
You could whip out your network analysis tools and start analyzing bandwidth utilization, broadcast traffic, router load, switch bandwidth, firewall utilization, and so forth, but doing so would require tons of time and might never point to a real performance bottleneck. If you do find a bottleneck, all you could really do is start shooting in the dark, making device configuration changes in an attempt to fix the bottleneck. More often than not, that practice simply reveals additional bottlenecks, creating an unending process of network configuration changes that never really improve performance. If you're after actual results, your best starting place is gathering some basic performance trend information and analyzing the network's change-management log.
If you can pin down a rough point in time when performance started to become less than optimal, you can start analyzing the changes that were made to the network's infrastructure devices around that time. You might discover, for example, a switch to a less-efficient routing protocol, or you might find that the routers connecting the various offices are providing packet filtering services. You might discover incorrectly configured multicast boundaries that are resulting in excess WAN traffic. Regardless, the configuration history can point to potential problems that contribute to the network's current condition. Discovering those problems empirically could take weeks or more, but finding them in the configuration history can be much, much easier.
The fact is that modern networks are becoming too large and too complex to manage as a single unit. Instead, you have to manage them in bits and pieces, and you have to manage them in small chunks of time. For example, suppose your company is getting ready to make a whole series of network device reconfigurations designed to improve performance or simply designed to increase network addressing capacity. Before making the changes, you can take a complete set of performance measurements. By taking another set of measurements after the changes are complete, you can determine the performance impact of the change, and relate those changes to specific configuration changes from the configuration history. You're not attempting to manage the network's overall performance. Instead, you're simply trying to manage the performance delta, or difference between the two configurations. Some administrators refer to this process as managing in increments, and it's an effective way to keep on top of large, complex networks.
Of course, managing in increments is only possible if you have a solid change-management process in place. The change-management process provides some important capabilities:
- Change management provides a logical checkpoint, allowing you the opportunity to take performance measurements before and after a discreet set of changes
- Change management provides a history, enabling you to compare before and after configurations and relate them to measured performance changes
- Change management provides a rollback mechanism, making it easier to revert to a previous configuration if the performance of a new configuration isn’t what you desired.
Ideally, you'll have access to software that can help gather and maintain device configuration information for historical and analytical purposes. That software might even allow you to store performance measurements so that you can save a performance baseline with each set of changes, defining a point in time at which that performance was measured and relating it to the device configuration that resulted in the performance.
Q 3.3: What are some industry best practices for troubleshooting network devices?
A: Network devices have been around a long time, and the technology industry has developed several best practices that make troubleshooting easier and often let you avoid the need to troubleshoot altogether. As author Scott M. Ballew states in his book Managing IP Networks with Cisco Routers (O’Reilly and Associates), “The best way to handle network problems is to avoid them.”
Here are some additional tips I've picked up over the years:
- Create detailed documentation of your network’s physical connections. One of the most common reasons for network downtime is swapped cables, and a detailed map of which wires go where can be a huge benefit during troubleshooting. Given the alternative—tugging on wires until you figure out where they go, making documentation is a great investment in time
- As I’ve described in other tips, document every change you make to network devices’ configurations, and have backup configurations ready in case a change backfires
- Your first troubleshooting step should often be to simply undo whatever it was you did last. Backup configuration files can make doing so very easy and will let you review the problem-causing changes at your leisure
- Make as few changes as possible at a time; that way, if problems occur, you’ll have fewer changes to sort through to find the cause. How long you wait between changes is a matter of personal taste; I like to wait at least 1 week so that my network can experience the full range of a week’s workload before I certify the change as a success. Of course, in a busy network environment that uses the latest technologies, limiting your workload can be difficult or impossible, making third-party change-management tools all the more valuable.
Experienced administrators have learned these tips through trial and error. You likely have a few other common practices you follow in your environment to keep things running smoothly.
Q 3.4: How can I determine whether a new product or a consultant makes changes to our network devices?
A: Large companies are likely to have any number of consultants and contractors running around on different projects at any given time. Some of them might have the authority to make changes to your network devices, probably with the understanding that they document any changes they make. However, there’s always a change or two that gets made right before the weekend that doesn’t make it into the documentation.
In addition, it’s possible for new software applications to make changes to your network devices. Suppose you’re evaluating a new network performance monitoring solution that needs to query information from your routers. Or perhaps you’re installing an enterprise management solution that needs credentials to access your managed network devices. In these cases, the software might make minor configuration changes to your devices without your knowledge. That’s not necessarily a bad thing; the changes made by these software packages are usually minor and simply make it easier for the software to do its job. But you still need to know about those changes in order to control your device change management process. So what can you do?
Unfortunately, very few network devices are designed to automatically notify an administrator when their configurations are changed. After all, only an administrator should have the credentials to make a change, so the devices quite reasonably assume that the administrator made any changes and doesn’t need to be notified.
Manually Detecting Changes
Most higher-end network devices allow you to use Trivial File Transfer Protocol (TFTP) to transfer the devices’ configuration files to a TFTP server (I explained how to set up a TFTP server in tip 4.2). If you regularly dump your devices’ configurations to TFTP and save the files, you have a baseline from which to check for changes to the devices’ configuration. For example, suppose you downloaded a router’s configuration into a file you named Router5Feb03.txt. A contractor recently finished installing a new enterprise management solution, and you want to see if any changes were made to Router5. Just follow these steps:
to Telnet to the router that you want to back up (for this example, I’ll assume you’re using a Cisco router; change the following commands as necessary if you’re using a different device). Obviously, you could also use the router’s IP address instead of a name.
2. Log on to the router.
and provide the correct password. Doing so enters privileged mode and lets you access the router’s configuration.
then enter the IP address of your TFTP server.
5. Enter the name of the configuration file (I’ll use Router5Mar03.txt for this example).
6. Press Enter to confirm the write. Ensure that the router responds with an [OK] prompt after writing the configuration.
to log out of the router.
Now you’ve got two text files, one with the old configuration and one with the new configuration. You simply need to compare the two. Assuming you’re running on a UNIX computer, enter the following
Diff -abls Router5Feb03.txt Router5Mar03.txt
If you’re using Windows, you can use a graphical version of Diff, called CSDiff, which I mentioned first in tip 4.2. It’s available from Component Software and makes it much easier to spot changes between versions of a text file. Best of all, it’s a free tool. Figure 3.1 shows how CSDiff highlights the differences between two text files.
Figure 3.1: Using CSDiff to analyze the differences in a router configuration file
Unfortunately, watching for changes manually is a lot of work. You have to regularly monitor for changes on each and every network device or you could easily miss one. Because the whole point of this exercise is to pick up changes that you didn’t know were being made, you need to have a change detection system that’s a bit more automated.
Proactive Change Notification
>> Software management solutions often use a more sophisticated comparison than a simple Diff. Instead, they create a cryptographic checksum of each version of a configuration file. The checksum can only be the same if no changes were made to the file; if any changes occur, the checksum is different, and the software knows to investigate more closely to determine exactly which changes occurred.
Using a checksum—rather than a line-by-line comparison—allows these software packages to accurately and quickly compare configuration files that might include thousands of lines of text.
Ideally, your change management software should allow you to configure daily reports. That way, you’ll be able to carefully review changes on a day-to-day basis rather than waiting a week or more and having to review dozens of potential changes. For example, as Figure 3.2 shows, DeviceAuthority provides a great deal of flexibility in scheduling reports. You can also configure reports to be emailed to multiple recipients. For example, I like to receive a copy of the report myself, and I have another copy sent to my Help desk manager for archival. Whenever we’re conducting a process audit, a third copy is emailed to an auditor, who compares the report to our official change log to verify our compliance with our internal change management process.
Figure 3.2: Creating a daily schedule keeps you on top of unexpected device changes and is a useful tool for auditing your change management process.
Although these change management software solutions involve additional expense and require effort to deploy, they provide a much better means of keeping tabs on your network devices than a manual process.
Automation on the Cheap
- Commanding devices to dump their configuration files via TFTP. If you have any devices that don’t support TFTP, you’re going to have a hard time automating a means of retrieving their configuration settings. Software solutions can pull configuration data from just about any kind of managed device, so if you have a lot of non-TFTP devices, you have one more argument for purchasing a software package.
- Comparing new and old configuration files.
- Emailing the results.
Each of these tasks can be performed on Windows- or UNIX-based computers, although the exact techniques obviously differ. Because Windows is the most common desktop OS, I’ll focus on techniques for Windows. Where possible, I’ll mention UNIX alternatives.
Automating the Configuration File Dumphttp://www.cyber.com.au/cyber/product/cybertel
Use the scriptable Telnet client of your choice to create a batch file. For example, suppose you decide to use the ZOC client, and you create a script named GetRouter5.zrx. This REXX script logs onto a particular router and commands it to write its configuration to a TFTP server. You’d then create a batch file, I’ll use Router5.bat as the filename, that contains the following text:
ZOC /RUN:SCRIPTGetRouter5.zrx /U
Note that the /U parameter places ZOC into unattended mode, forcing it to take the default settings for any prompts rather than hanging and waiting for a reply.
After the batch file is ready, use Windows’ Task Scheduler to schedule the batch file to run once a day, say at around 1:00 AM. On UNIX systems, you can use CRON to set up a similar automation, using a scriptable Telnet client for UNIX. So every morning at 1:00 AM, this batch file will run and command the router to dump its configuration to your TFTP server.
>> If you have multiple devices (and who doesn’t?), simply create a Telnet script for each one. Include multiple lines in your batch file, with each line executing the Telnet client and one Telnet script. The batch file will then run through each device in turn, commanding them to dump their configuration to TFTP.
Automating the File ComparisonMKS
diff -ir -c folder1 folder2
The cool part about this utility is that it can compare all of the files in a folder. So suppose you’ve stored your most recent configuration files in a folder named Old, and you’ve had your devices TFTP their current configurations to a folder named Current. You could execute the following command:
diff -ir -c Old Current > changed.txt
This command will compare each and every file in the two folders and write the results to a file named Changed.txt. The results will include each changed line, plus an additional three lines before and after the change to help you locate the change’s context. If you’re using this technique, it’s important that your devices dump their configurations to the same filename each time. Simply create a new batch file— probably on your TFTP server, where the files are located—and schedule it to run by using Task Scheduler. If you set it to run at about 3:00 AM, that should give your first batch file time to complete.
Emailing the File Comparison Results
—clemail -quiet -from firstname.lastname@example.org
Of course, you’ll need to type all of that on a single line. Schedule the batch file to run at about 4:00 AM, after the second file finishes running, and you should have an email waiting in your mailbox when you get to work.
So there you have it, a no-cost (or low-cost, depending on how much you pay for the various utilities you’ll need) solution for automatically detecting changes to network device configurations and emailing those changes to you in a daily report. It’s a lot of work to set up, and you’ll need to fine-tune it to work in your environment. After a while, I suspect you’ll start looking at those change management solutions with a new appreciation for the work that they do!
Q 3.5: Troubleshooting network devices is complicated. Is there a general framework that can make it easier?
A: There’s no industry-standard framework to make network device troubleshooting easier, but there are several resources that can help you develop a framework that works in your environment:
- Cisco provides a detailed Internetwork Troubleshooting Guide at https://www.cisco.com/univercd/cc/td/doc/cisintwk/itg_v1/index.htm. This guide provides troubleshooting steps for just about every aspect of network troubleshooting.
- I often use the links at http://www.teklnk.com/links.htm to find troubleshooting resources. There’s a wealth of tips, tools, and concepts for Cisco, Nortel, and a variety of other vendors.
As I’ve mentioned in previous tips, the best place to start troubleshooting network devices is to look at what has recently changed. You can usually trace most device problems to a recent configuration change that’s not working out as well as you’d hoped; network change management software or even simple text file comparisons of device configurations can help highlight recent changes and let you quickly focus your troubleshooting efforts.
Q 3.6: What is the best way to start troubleshooting router problems?
A: That’s a tall order! Routers are complex, powerful computers in their own right, and can have several problems: routing tables can be wrong, CPU utilization can be high, network interfaces might be down, passwords can be lost, or the router might simply crash.
The best way to start, no matter what the problem, is with a step-by-step troubleshooting flowchart. Most routers’ documentation includes basic troubleshooting flowcharts, which are designed to help narrow the problem as much as possible.
Most manufacturers, including Cisco, Nortel, and 3Com, offer flowcharts for their devices and provide them for download from their Web sites. For example, Cisco 7304 router troubleshooting is available at https://www.cisco.com/pcgi-bin/tsa7304/trouble.pl?tree=7304. You start by selecting from a basic menu of problems (for example, high CPU utilization, interface issues, IOS upgrade, line card issues, password recovery, power, PXF feature support, router crash, and startup). Suppose you were to select interface issues from the main menu; the troubleshooter would walk you through a variety of questions to narrow the problem:
- Are you using an ATM interface?
- What is the output of show interfaces pos?
- What encapsulation method-such as frame relay or PPP-are you using?
At the end, the troubleshooter displays a recommended solution. This might include links to other portions of the troubleshooting tree to eliminate or confirm potential causes of the problem.
Cisco also offers these flowcharts in PDF format so that you don’t need Internet access to use them. For the 7304 router, you can download PDF flowcharts by going to https://www.cisco.com/pcgi-bin/tsa7304/flows.pl?tree=7304, then clicking Flow Charts in the left-hand menu.
>> Cisco offers flowcharts for most of its network devices, and you can access all of them from the support section of Cisco’s Web site.
Q 3.7: We have a number of junior administrators, so we need to make network device troubleshooting more of a science and less of an art. What can we do?
A: You can create a sound troubleshooting methodology. To do so, simply answer this question: “How do you find a wolf in Siberia?” Sounds frivolous, but it’s a similar task to network device troubleshooting, which can often seem to an inexperienced administrator like looking for a needle in a haystack. The answer provides the solution: Build a wolf-proof fence down the middle of Siberia, and look for the wolf on one side. If he’s not there, divide what’s left in half again, and repeat. Technically, the technique is referred to as a binary search.
An Example Problem
Consider the network diagram that Figure 3.3 shows. Imagine that the client using the laptop computer isn’t able to communicate with the desktop computer in Office 1.
Figure 3.3: Sample troubleshooting problem.
This is a simplistic example, but it will serve to illustrate a troubleshooting methodology, which can be used for any problem, no matter how complex.
Identifying the Problem Domain
The first step is to simply make a list of everything that could be causing the problem. Experienced administrators do this in their head, but it’s worth writing down the list if you’re just getting the hang of troubleshooting. In this case, the list might include:
- Laptop unplugged
- Laptop network stack failure
- Desktop unplugged
- Desktop network stack failure
- Router in Office 3 failed
- Router in Office 1 failed
- WAN link failed
- DNS server not working
- Bad routes in Office 1 router
- Bad routes in Office 3 router
It’s important to make this list because doing so will rule out elements that might seem to be problems—such as the router in Office 2—that obviously aren’t. Of course, the ability to generate a list such as this example list requires a thorough understanding of how the network is built (having documentation such as the network diagram is invaluable) and a thorough knowledge of how the network operates. For example, if you don’t know how computers resolve names to IP addresses, you might not suspect the DNS server.
Breaking the Testable Systems in Half
Next, develop some logical means of dividing the land in half. In this case, about half the potential problems seem to be router-related, and the other half are client-related; breaking the list along those lines creates a basically even set of possibilities.
|Router Problems||Client Problems|
|Bad routes in Office 1 router||Laptop unplugged|
|Bad routes in Office 3 router||Desktop unplugged|
|Office 1 router failed||Stack failure in laptop|
|Office 2 router failed||Stack failure in desktop|
|Bad WAN line||DNS server failed|
Figure 3.4 illustrates how this process effectively divides your suspect subsystems into a logical half.
Figure 3.4: Dividing the suspect subsystems into half.
Now you need to build your wolf-proof fence down the middle by conducting a test.
The only useful troubleshooting tests are those that allow you to definitively eliminate some potential problem. For example, suppose you determine that the laptop computer also can’t connect to a server in Office 2. What have you proven? Well, nothing, really. You can’t even say for sure that the Office 3 router is OK, although it’s now less likely that it has failed or has a bad route. In other words, you haven’t built a wolf-proof fence at all.
Suppose, however, that you are able to connect to computers on the Office 3 network from the laptop, and connect to computers on the Office 1 network from the desktop. That’s a definitive test: you can eliminate half of your suspect systems from the list because you’ve proven that they work.
>> Stuck for tests? Go one-by-one. If you can’t readily think of a test that will result in your wolf-proof fence, you can just eliminate half of the list on a subsystem one at a time. For example, you can check the connections on both computers and ensure that they can ping their gateways to ensure that their stacks are functioning. You can use nslookup to test the DNS server(s) to eliminate them from the list. However, efficient troubleshooting requires you to be able to divide the list in such a way that one or two tests can eliminate half the list. That type of efficiency comes primarily with string knowledge of how the network works and with good old experience.
Divide, Conquer, Repeat
With half the list out of the way, you can start working on the other half. Figure 3.5 illustrates the systems you’ve eliminated, including DNS servers at each office (shown in the diagram as Server1B and Server3B), the client computers, and their network connections.
Figure 3.5: Half the suspect systems eliminated, with just the green-colored half to go.
Additional tests at this point could involve logging on to one of the two routers and attempting to ping the other one. That test, if it worked, would eliminate the WAN links as a potential suspect and let you know that at least the routers’ external interfaces are up and running. You’d be down to a quarter of your original list, and the odds would start looking good for a bad route in one of the routers. Manually checking the routing tables would let you know whether that was the problem.
In some cases, you might be able to go after the entire list of suspect systems with one good test. For example, running tracert from the laptop to the desktop will help you eliminate most, if not all, of the suspect systems. If DNS has failed, tracert will tell you so. If it’s a local connectivity issue, you’ll see that in the results. If a router has a bad route, you’ll see that in the results, too. A WAN failure won’t be distinguishable from a failed router interface, but you’ll at least have narrowed the list to two possible candidates.
>> Know your tools! Another trick to performing this methodology is having thorough knowledge of the troubleshooting tools at your disposal. Knowing what ping, pathping, and tracert can do, for example, will enable you to select the most effective test for eliminating a particular subsystem.
Selecting the right testing tools can make all the difference, particular with regard to efficiency. For example, if you were following the troubleshooting path I’ve been using, you might have spent an hour or so figuring out that a bad route was at fault. Tracert, however, could have brought you to this conclusion in 5 minutes or so. However, you would have found the problem either way, eventually, proving that the methodology is useful even to an administrator without years of experience.
Now It’s a Science
Where do most new administrators get caught up? First, they might not completely understand how the network functions, so they ignore suspect subsystems and spend their time troubleshooting only part of the problem. Second, they often don’t perform conclusive tests—they might incorrectly eliminate a suspect subsystem, and waste time looking for wolves in the wrong part of Siberia.
It’s a simple methodology, one that experienced administrators follow almost without thinking about it—which makes it difficult to teach to newer personnel. To summarize:
- Identify the actual cause of the problem
- List suspect subsystems
- Break the list into halves so that one half can be eliminated by one or two conclusive tests
- Perform conclusive tests to focus on one half or the other; repeat the process by splitting what’s left into half
- Ensure that all tests can conclusively eliminate something; essentially, all tests must prove that something is either working or not with no room for question
This tried-and-true methodology becomes instinctive through experience, but for less experienced technical professionals, it can make the daunting task of network troubleshooting more approachable, methodical, and efficient.
Tips and Tricks Guide to Network Configuration Management
To download/read the eBook in its entirety, visit: http://www.alterpoint.com/ebook