The book from which this chapter is excerpted presents tips and tricks for four network configuration management topics. For ease of use, the questions and their solutions are divided into sections based on topic, and each question is numbered based on the topic, including Topic 1: Change Management Best Practices, Topic 2: Network Management Security, Topic 3: Network Management Troubleshooting, Topic 4: Change Management Techniques, Topic 5: Selecting and Deploying a Network Device Management Solution, and Topic 6: Enterprise Network Device Management.
To download/read the eBook in its entirety, visit:
Topic 3: Network Management Troubleshooting
Q 3.1: What is the first step toward fixing a router that isnt working?
A: The first question you should ask is "What changed?" Very few network devices go belly up on their own; youll find that it usually requires human involvement to really screw things up. Assuming that youve eliminated some kind of hardware failure as the cause of the problem, the culprit is most likely a recent change made to the devices configuration. Of course, if the hardware is at fault, you simply need to replace the hardware and restore your configuration from a backup.
Restoring from a backupyou do have a backup of the routers configuration, dont you?is a good first step even if the hardware is fine. Ideally, the backup configuration will resolve the problem, and you can use a tool to compare the old and new configurations to determine the differences. Thats not exactly troubleshooting the problem, but unless youre working in a lab, your goal should be to restore the device to operation first, and figure out what caused the problem later.
>> One change at a time, please! The idea of using a known-good backup to recover from a device failure only works if you tend to make a small number of changes at a time, let them settle to ensure that theyre working properly, then immediately make a backup. If youre in the habit of making a raft of changes at once, youll have a much more difficult time tracking down the change that caused the problem.
If you dont actually have a recent backup, shame on you! Hopefully you have change management documentation that describes the changes that have been made to the router in recent memory. Start examining those changes to see which ones might apply to the problem youre having. If necessary, manually undo each change, one at a time, until the problem goes away.
Other changes might involve a device operating system (OS) upgrade or patch. In such cases, you should never make a change without understanding how you can rollback to the prior (working) version of the OS. If necessary, keep a spare router on hand in case the OS upgrade or patch kills your production unit. The goal, in any event, is to not worry so much about troubleshooting the current problem, and to simply fall back to the last configuration that worked.
Keep in mind that not all changes need to involve the routers configuration files or OS. For example, perhaps your company recently hired someone to straighten out that rats nest of a wiring closet, and that person accidentally plugged the router into the wrong subnet when he or she put the closet back together. The wiring closet change should have been documented as a network change, and would tip you off that you need to check out the routers interfaces to see what theyre plugged into.
>> Theres no such thing as a minor change! Every single change to your network devices should go through your change management process. No change is too minor. Weve all heard the story about the technician who blew dust out of a routers cooling fan. He blew hard enough to stop the fan, causing the router to overheat and restart itself at seemingly random intervals. Had that simple maintenance actioncleaning out the routerbeen logged as a change, a senior administrator might have guessed that the problem was in the cooling fan, and checked that out first for a speedier resolution to the problem.
Of course, if you dont have a change management program in place or, at least, a backup of the routers configuration, youre out of easy options. Youll need to start troubleshooting the problem the hard way, which might eventually involve completely reloading the routers factory configuration and rebuilding your configuration from scratch. Such drastic measures highlight the importance of both backups and a solid change management methodology.
Q 3.2: How can change management contribute to improved network performance?
A: Managing large networks is a complex, difficult task. Suppose you took a job at a large corporation with tens of thousands of users spread across dozens of offices. Your job, you're told, is to find out why network performance is slow. Where do you start?
You could whip out your network analysis tools and start analyzing bandwidth utilization, broadcast traffic, router load, switch bandwidth, firewall utilization, and so forth, but doing so would require tons of time and might never point to a real performance bottleneck. If you do find a bottleneck, all you could really do is start shooting in the dark, making device configuration changes in an attempt to fix the bottleneck. More often than not, that practice simply reveals additional bottlenecks, creating an unending process of network configuration changes that never really improve performance. If you're after actual results, your best starting place is gathering some basic performance trend information and analyzing the network's change-management log.
If you can pin down a rough point in time when performance started to become less than optimal, you can start analyzing the changes that were made to the network's infrastructure devices around that time. You might discover, for example, a switch to a less-efficient routing protocol, or you might find that the routers connecting the various offices are providing packet filtering services. You might discover incorrectly configured multicast boundaries that are resulting in excess WAN traffic. Regardless, the configuration history can point to potential problems that contribute to the network's current condition. Discovering those problems empirically could take weeks or more, but finding them in the configuration history can be much, much easier.
The fact is that modern networks are becoming too large and too complex to manage as a single unit. Instead, you have to manage them in bits and pieces, and you have to manage them in small chunks of time. For example, suppose your company is getting ready to make a whole series of network device reconfigurations designed to improve performance or simply designed to increase network addressing capacity. Before making the changes, you can take a complete set of performance measurements. By taking another set of measurements after the changes are complete, you can determine the performance impact of the change, and relate those changes to specific configuration changes from the configuration history. You're not attempting to manage the network's overall performance. Instead, you're simply trying to manage the performance delta, or difference between the two configurations. Some administrators refer to this process as managing in increments, and it's an effective way to keep on top of large, complex networks.
Of course, managing in increments is only possible if you have a solid change-management process in place. The change-management process provides some important capabilities:
- Change management provides a logical checkpoint, allowing you the opportunity to take performance measurements before and after a discreet set of changes
- Change management provides a history, enabling you to compare before and after configurations and relate them to measured performance changes
- Change management provides a rollback mechanism, making it easier to revert to a previous configuration if the performance of a new configuration isn’t what you desired.
Ideally, you'll have access to software that can help gather and maintain device configuration information for historical and analytical purposes. That software might even allow you to store performance measurements so that you can save a performance baseline with each set of changes, defining a point in time at which that performance was measured and relating it to the device configuration that resulted in the performance.
Q 3.3: What are some industry best practices for troubleshooting network devices?
A: Network devices have been around a long time, and the technology industry has developed several best practices that make troubleshooting easier and often let you avoid the need to troubleshoot altogether. As author Scott M. Ballew states in his book Managing IP Networks with Cisco Routers (O’Reilly and Associates), “The best way to handle network problems is to avoid them.”
Here are some additional tips I've picked up over the years:
- Create detailed documentation of
your network’s physical connections. One of the most common
reasons for network downtime is swapped cables, and a detailed
map of which wires go where can be a huge benefit during troubleshooting.
Given the alternative—tugging on wires until you figure
out where they go, making documentation is a great investment
- As I’ve described in other
tips, document every change you make to network devices’
configurations, and have backup configurations ready in case
a change backfires
- Your first troubleshooting step
should often be to simply undo whatever it was you did last.
Backup configuration files can make doing so very easy and will
let you review the problem-causing changes at your leisure
- Make as few changes as possible
at a time; that way, if problems occur, you’ll have fewer
changes to sort through to find the cause. How long you wait
between changes is a matter of personal taste; I like to wait
at least 1 week so that my network can experience the full range
of a week’s workload before I certify the change as a
success. Of course, in a busy network environment that uses
the latest technologies, limiting your workload can be difficult
or impossible, making third-party change-management tools all
the more valuable.
Experienced administrators have learned these tips through trial and error. You likely have a few other common practices you follow in your environment to keep things running smoothly.
Q 3.4: How can I determine whether a new product or a consultant makes changes to our network devices?
A: Large companies are likely to have any number of consultants and contractors running around on different projects at any given time. Some of them might have the authority to make changes to your network devices, probably with the understanding that they document any changes they make. However, there’s always a change or two that gets made right before the weekend that doesn’t make it into the documentation.
In addition, it’s possible for new software applications to make changes to your network devices. Suppose you’re evaluating a new network performance monitoring solution that needs to query information from your routers. Or perhaps you’re installing an enterprise management solution that needs credentials to access your managed network devices. In these cases, the software might make minor configuration changes to your devices without your knowledge. That’s not necessarily a bad thing; the changes made by these software packages are usually minor and simply make it easier for the software to do its job. But you still need to know about those changes in order to control your device change management process. So what can you do?
Unfortunately, very few network devices are designed to automatically notify an administrator when their configurations are changed. After all, only an administrator should have the credentials to make a change, so the devices quite reasonably assume that the administrator made any changes and doesn’t need to be notified.
Manually Detecting Changes
Most higher-end network devices allow you to use Trivial File Transfer Protocol (TFTP) to transfer the devices’ configuration files to a TFTP server (I explained how to set up a TFTP server in tip 4.2). If you regularly dump your devices’ configurations to TFTP and save the files, you have a baseline from which to check for changes to the devices’ configuration. For example, suppose you downloaded a router’s configuration into a file you named Router5Feb03.txt. A contractor recently finished installing a new enterprise management solution, and you want to see if any changes were made to Router5. Just follow these steps:
to Telnet to the router that you want to back up (for this example, I’ll assume you’re using a Cisco router; change the following commands as necessary if you’re using a different device). Obviously, you could also use the router’s IP address instead of a name.
2. Log on to the router.
and provide the correct password. Doing so enters privileged mode and lets you access the router’s configuration.
then enter the IP address of your TFTP server.
5. Enter the name of the configuration file (I’ll use Router5Mar03.txt for this example).
6. Press Enter to confirm the write. Ensure that the router responds with an [OK] prompt after writing the configuration.
to log out of the router.
Now you’ve got two text files, one with the old configuration and one with the new configuration. You simply need to compare the two. Assuming you’re running on a UNIX computer, enter the following
Diff -abls Router5Feb03.txt Router5Mar03.txt
If you’re using Windows, you can use a graphical version of Diff, called CSDiff, which I mentioned first in tip 4.2. It’s available from Component Software and makes it much easier to spot changes between versions of a text file. Best of all, it’s a free tool. Figure 3.1 shows how CSDiff highlights the differences between two text files.
Figure 3.1: Using CSDiff to analyze the differences in a router configuration file
Unfortunately, watching for changes manually is a lot of work. You have to regularly monitor for changes on each and every network device or you could easily miss one. Because the whole point of this exercise is to pick up changes that you didn’t know were being made, you need to have a change detection system that’s a bit more automated.
Proactive Change NotificationEnter device change management software. Most of the big players in this field, including AlterPoint DeviceAuthority, Tripwire, and Cisco’s CiscoWorks can immediately notify you via email when a network device’s configuration changes. These solutions run on a server, and periodically (usually daily, although you can configure more frequent intervals) download your devices’ configuration files. They then perform an internal comparison—not unlike the manual Diff I used earlier—to compare the most recent configuration with the last one they downloaded. If they spot any changes, they generate an email to an administrator.
>> Software management solutions often use a more sophisticated comparison than a simple Diff. Instead, they create a cryptographic checksum of each version of a configuration file. The checksum can only be the same if no changes were made to the file; if any changes occur, the checksum is different, and the software knows to investigate more closely to determine exactly which changes occurred.
Using a checksum—rather than a line-by-line comparison—allows these software packages to accurately and quickly compare configuration files that might include thousands of lines of text.
Ideally, your change management software should allow you to configure daily reports. That way, you’ll be able to carefully review changes on a day-to-day basis rather than waiting a week or more and having to review dozens of potential changes. For example, as Figure 3.2 shows, DeviceAuthority provides a great deal of flexibility in scheduling reports. You can also configure reports to be emailed to multiple recipients. For example, I like to receive a copy of the report myself, and I have another copy sent to my Help desk manager for archival. Whenever we’re conducting a process audit, a third copy is emailed to an auditor, who compares the report to our official change log to verify our compliance with our internal change management process.
Figure 3.2: Creating a daily schedule keeps you on top of unexpected device changes and is a useful tool for auditing your change management process.
Although these change management software solutions involve additional expense and require effort to deploy, they provide a much better means of keeping tabs on your network devices than a manual process.
Automation on the CheapIf you’re completely unable to implement a change management software solution, you’re not completely out of luck. You can still automate parts of the manual detection process and provide some basic functionality for keeping track of unexpected changes to network devices. Basically, you need to break down the process into its component steps, and come up with a means of automating each step:
- Commanding devices to dump their
configuration files via TFTP. If you have any devices that don’t
support TFTP, you’re going to have a hard time automating
a means of retrieving their configuration settings. Software
solutions can pull configuration data from just about any kind
of managed device, so if you have a lot of non-TFTP devices,
you have one more argument for purchasing a software package.
- Comparing new and old configuration
- Emailing the results.
Each of these tasks can be performed on Windows- or UNIX-based computers, although the exact techniques obviously differ. Because Windows is the most common desktop OS, I’ll focus on techniques for Windows. Where possible, I’ll mention UNIX alternatives.
Automating the Configuration File DumpYou need to be able to script a Telnet session to automatically log onto your devices and command a TFTP dump. Unfortunately, Windows’ built-in Telnet client doesn’t support scripting. However, you can get a scriptable Telnet client, called Cybersource Scriptable Telnet, from http://www.cyber.com.au/cyber/product/cybertel. Another scriptable client, which I prefer, is the ZOC Terminal Emulator and Telnet/SSH Client available from http://www.emtec.com. ZOC understands a superset of the REXX scripting language, which make it a pretty powerful automation tool.
Use the scriptable Telnet client of your choice to create a batch file. For example, suppose you decide to use the ZOC client, and you create a script named GetRouter5.zrx. This REXX script logs onto a particular router and commands it to write its configuration to a TFTP server. You’d then create a batch file, I’ll use Router5.bat as the filename, that contains the following text:
ZOC /RUN:SCRIPTGetRouter5.zrx /U
Note that the /U parameter places ZOC into unattended mode, forcing it to take the default settings for any prompts rather than hanging and waiting for a reply.
After the batch file is ready, use Windows’ Task Scheduler to schedule the batch file to run once a day, say at around 1:00 AM. On UNIX systems, you can use CRON to set up a similar automation, using a scriptable Telnet client for UNIX. So every morning at 1:00 AM, this batch file will run and command the router to dump its configuration to your TFTP server.
>> If you have multiple devices (and who doesn’t?), simply create a Telnet script for each one. Include multiple lines in your batch file, with each line executing the Telnet client and one Telnet script. The batch file will then run through each device in turn, commanding them to dump their configuration to TFTP.
Automating the File ComparisonYou don’t want a fancy GUI to automate file comparison, so CSDiff isn’t really appropriate. Instead, you want a basic command-line Diff (like the UNIX guys have) that will output differences to a file. You can get one from MKS. The syntax to use is:
diff -ir -c folder1 folder2
The cool part about this utility is that it can compare all of the files in a folder. So suppose you’ve stored your most recent configuration files in a folder named Old, and you’ve had your devices TFTP their current configurations to a folder named Current. You could execute the following command:
diff -ir -c Old Current > changed.txt
This command will compare each and every file in the two folders and write the results to a file named Changed.txt. The results will include each changed line, plus an additional three lines before and after the change to help you locate the change’s context. If you’re using this technique, it’s important that your devices dump their configurations to the same filename each time. Simply create a new batch file— probably on your TFTP server, where the files are located—and schedule it to run by using Task Scheduler. If you set it to run at about 3:00 AM, that should give your first batch file time to complete.
Emailing the File Comparison ResultsYou’re ready to email Changed.txt, the file that contains any changes found in your device configuration files. You’ll need a command-line email utility, such as BySoft’s Command Line E-mailer at http://www.bysoft.se. Create a third batch file with this command:
—clemail -quiet -from email@example.com
Of course, you’ll need to type all of that on a single line. Schedule the batch file to run at about 4:00 AM, after the second file finishes running, and you should have an email waiting in your mailbox when you get to work.
So there you have it, a no-cost (or low-cost, depending on how much you pay for the various utilities you’ll need) solution for automatically detecting changes to network device configurations and emailing those changes to you in a daily report. It’s a lot of work to set up, and you’ll need to fine-tune it to work in your environment. After a while, I suspect you’ll start looking at those change management solutions with a new appreciation for the work that they do!
Q 3.5: Troubleshooting network devices is complicated. Is there a general framework that can make it easier?
A: There’s no industry-standard framework to make network device troubleshooting easier, but there are several resources that can help you develop a framework that works in your environment:
- Cisco provides a detailed Internetwork
Troubleshooting Guide at http://www.cisco.com/univercd/cc/td/doc/cisintwk/itg_v1/index.htm.
This guide provides troubleshooting steps for just about every
aspect of network troubleshooting.
- I often use the links at http://www.teklnk.com/links.htm
to find troubleshooting resources. There’s a wealth of tips,
tools, and concepts for Cisco, Nortel, and a variety of other
As I’ve mentioned in previous tips, the best place to start troubleshooting network devices is to look at what has recently changed. You can usually trace most device problems to a recent configuration change that’s not working out as well as you’d hoped; network change management software or even simple text file comparisons of device configurations can help highlight recent changes and let you quickly focus your troubleshooting efforts.
Q 3.6: What is the best way to start troubleshooting router problems?
A: That’s a tall order! Routers are complex, powerful computers in their own right, and can have several problems: routing tables can be wrong, CPU utilization can be high, network interfaces might be down, passwords can be lost, or the router might simply crash.
The best way to start, no matter what the problem, is with a step-by-step troubleshooting flowchart. Most routers’ documentation includes basic troubleshooting flowcharts, which are designed to help narrow the problem as much as possible.
Most manufacturers, including Cisco, Nortel, and 3Com, offer flowcharts for their devices and provide them for download from their Web sites. For example, Cisco 7304 router troubleshooting is available at http://www.cisco.com/pcgi-bin/tsa7304/trouble.pl?tree=7304. You start by selecting from a basic menu of problems (for example, high CPU utilization, interface issues, IOS upgrade, line card issues, password recovery, power, PXF feature support, router crash, and startup). Suppose you were to select interface issues from the main menu; the troubleshooter would walk you through a variety of questions to narrow the problem:
- Are you using an ATM interface?
- What is the output of show interfaces pos?
- What encapsulation method-such as frame relay or PPP-are you using?
At the end, the troubleshooter displays a recommended solution. This might include links to other portions of the troubleshooting tree to eliminate or confirm potential causes of the problem.
Cisco also offers these flowcharts in PDF format so that you don’t need Internet access to use them. For the 7304 router, you can download PDF flowcharts by going to http://www.cisco.com/pcgi-bin/tsa7304/flows.pl?tree=7304, then clicking Flow Charts in the left-hand menu.
>> Cisco offers flowcharts for most of its network devices, and you can access all of them from the support section of Cisco’s Web site.
Q 3.7: We have a number of junior administrators, so we need to make network device troubleshooting more of a science and less of an art. What can we do?
A: You can create a sound troubleshooting methodology. To do so, simply answer this question: “How do you find a wolf in Siberia?” Sounds frivolous, but it’s a similar task to network device troubleshooting, which can often seem to an inexperienced administrator like looking for a needle in a haystack. The answer provides the solution: Build a wolf-proof fence down the middle of Siberia, and look for the wolf on one side. If he’s not there, divide what’s left in half again, and repeat. Technically, the technique is referred to as a binary search.
An Example Problem
Consider the network diagram that Figure 3.3 shows. Imagine that the client using the laptop computer isn’t able to communicate with the desktop computer in Office 1.
Figure 3.3: Sample troubleshooting problem.
This is a simplistic example, but it will serve to illustrate a troubleshooting methodology, which can be used for any problem, no matter how complex.
Identifying the Problem Domain
The first step is to simply make a list of everything that could be causing the problem. Experienced administrators do this in their head, but it’s worth writing down the list if you’re just getting the hang of troubleshooting. In this case, the list might include:
- Laptop unplugged
- Laptop network stack failure
- Desktop unplugged
- Desktop network stack failure
- Router in Office 3 failed
- Router in Office 1 failed
- WAN link failed
- DNS server not working
- Bad routes in Office 1 router
- Bad routes in Office 3 router
It’s important to make this list because doing so will rule out elements that might seem to be problems—such as the router in Office 2—that obviously aren’t. Of course, the ability to generate a list such as this example list requires a thorough understanding of how the network is built (having documentation such as the network diagram is invaluable) and a thorough knowledge of how the network operates. For example, if you don’t know how computers resolve names to IP addresses, you might not suspect the DNS server.
Breaking the Testable Systems in Half
Next, develop some logical means of dividing the land in half. In this case, about half the potential problems seem to be router-related, and the other half are client-related; breaking the list along those lines creates a basically even set of possibilities.
|Router Problems||Client Problems|
|Bad routes in Office 1 router||Laptop unplugged|
|Bad routes in Office 3 router||Desktop unplugged|
|Office 1 router failed||Stack failure in laptop|
|Office 2 router failed||Stack failure in desktop|
|Bad WAN line||DNS server failed|
Figure 3.4 illustrates how this process effectively divides your suspect subsystems into a logical half.
Figure 3.4: Dividing the suspect subsystems into half.
Now you need to build your wolf-proof fence down the middle by conducting a test.
The only useful troubleshooting tests are those that allow you to definitively eliminate some potential problem. For example, suppose you determine that the laptop computer also can’t connect to a server in Office 2. What have you proven? Well, nothing, really. You can’t even say for sure that the Office 3 router is OK, although it’s now less likely that it has failed or has a bad route. In other words, you haven’t built a wolf-proof fence at all.
Suppose, however, that you are able to connect to computers on the Office 3 network from the laptop, and connect to computers on the Office 1 network from the desktop. That’s a definitive test: you can eliminate half of your suspect systems from the list because you’ve proven that they work.
>> Stuck for tests? Go one-by-one. If you can’t readily think of a test that will result in your wolf-proof fence, you can just eliminate half of the list on a subsystem one at a time. For example, you can check the connections on both computers and ensure that they can ping their gateways to ensure that their stacks are functioning. You can use nslookup to test the DNS server(s) to eliminate them from the list. However, efficient troubleshooting requires you to be able to divide the list in such a way that one or two tests can eliminate half the list. That type of efficiency comes primarily with string knowledge of how the network works and with good old experience.
Divide, Conquer, Repeat
With half the list out of the way, you can start working on the other half. Figure 3.5 illustrates the systems you’ve eliminated, including DNS servers at each office (shown in the diagram as Server1B and Server3B), the client computers, and their network connections.
Figure 3.5: Half the suspect systems eliminated, with just the green-colored half to go.
Additional tests at this point could involve logging on to one of the two routers and attempting to ping the other one. That test, if it worked, would eliminate the WAN links as a potential suspect and let you know that at least the routers’ external interfaces are up and running. You’d be down to a quarter of your original list, and the odds would start looking good for a bad route in one of the routers. Manually checking the routing tables would let you know whether that was the problem.
In some cases, you might be able to go after the entire list of suspect systems with one good test. For example, running tracert from the laptop to the desktop will help you eliminate most, if not all, of the suspect systems. If DNS has failed, tracert will tell you so. If it’s a local connectivity issue, you’ll see that in the results. If a router has a bad route, you’ll see that in the results, too. A WAN failure won’t be distinguishable from a failed router interface, but you’ll at least have narrowed the list to two possible candidates.
>> Know your tools! Another trick to performing this methodology is having thorough knowledge of the troubleshooting tools at your disposal. Knowing what ping, pathping, and tracert can do, for example, will enable you to select the most effective test for eliminating a particular subsystem.
Selecting the right testing tools can make all the difference, particular with regard to efficiency. For example, if you were following the troubleshooting path I’ve been using, you might have spent an hour or so figuring out that a bad route was at fault. Tracert, however, could have brought you to this conclusion in 5 minutes or so. However, you would have found the problem either way, eventually, proving that the methodology is useful even to an administrator without years of experience.
Now It’s a Science
Where do most new administrators get caught up? First, they might not completely understand how the network functions, so they ignore suspect subsystems and spend their time troubleshooting only part of the problem. Second, they often don’t perform conclusive tests—they might incorrectly eliminate a suspect subsystem, and waste time looking for wolves in the wrong part of Siberia.
It’s a simple methodology, one that experienced administrators follow almost without thinking about it—which makes it difficult to teach to newer personnel. To summarize:
- Identify the actual cause of the problem
- List suspect subsystems
- Break the list into halves so that one half can be eliminated by one or two conclusive tests
- Perform conclusive tests to focus on one half or the other; repeat the process by splitting what’s left into half
- Ensure that all tests can conclusively eliminate something; essentially, all tests must prove that something is either working or not with no room for question
This tried-and-true methodology becomes instinctive through experience, but for less experienced technical professionals, it can make the daunting task of network troubleshooting more approachable, methodical, and efficient.
Copyright 2003 Realtimepublishers.com, Inc.
This text is excerpted from the eBook Tips and Tricks Guide to Network Configuration Management, Chapter 3: Network Management Troubleshooting
To download/read the eBook in its entirety, visit: http://www.alterpoint.com/ebook