This tip is excerpted from "Network troubleshooting and diagnostics," Chapter 4 of The Shortcut Guide to Network Management for the Mid-Market, written by Greg Shields and published by Realtimepublishers.com. You can read the entire e-book for free at the link above.
When a problem occurs on your network, what do you usually do first? Do you ping the network device to see whether it is up and responding? Do you dive into the network closet to determine whether the LED lights are still blinking green or switched to red? Do you contact the user who called or log into the NMS that notified to find out more about the problem?
None of these initial triage maneuvers are any better than any other. For those who ping the device first, you immediately get a low-level understanding of whether that device can communicate with your workstation. For those that dive into the network closet, you can quickly find out if the interface is showing errors or is down. For those that contact the user or work with the notifying agent, you get an immediate first-hand look into the error that kicked up the notification.
In each situation, the most important part of troubleshooting is developing a good technique. No matter what the problem is on your network, you'll find that having a good technique for finding that problem helps you quickly identify where the root cause exists so that you can work towards a solution.
|It's worth stating that a good troubleshooter is a fast troubleshooter. Throughout this chapter, you'll notice that the troubleshooting processes, as well as the tools discussed, are designed to assist you with quickly coming to a resolution.|
OSI as a troubleshooting framework
One commonly used framework for troubleshooting that helps structure your response to a known network problem is the International Standards Organization (ISO) Open Systems Interconnect (OSI) model. If you've worked with networking devices for any period of time, you are likely already familiar with OSI. It's the framework that encapsulates much of modern networking, and most network protocols live somewhere within its seven layers. Where you may not have used it before is as a troubleshooting guide for triaging an unknown problem on the network.
Figure 4.1: The OSI model is an excellent mental framework to assist the troubleshooter with identifying network problems.
Without going into too much detail on the history and use of model, let's take a look at how you can extend the OSI model into a framework for problem isolation. Figure 4.1 shows the seven layers in the OSI model and some issues that typically occur related to each layer. Let's discuss each of the layers in-turn from the bottom-up:
- At the Physical layer, problems typically involve some break in the physical connectivity that makes up the network. Broken network connections, cabling and connector issues, and hardware problems that inhibit the movement of electricity from device to device typically indicate a problem at this layer.
- At the Data Link layer, we move away from purely electrical problems and into the configuration of the interface itself. Data Link problems often have to do with Address Resolution Protocol (ARP) problems in relating IP addresses to Media Access Control (MAC) addresses. These can be caused by speed and duplex mismatching between network devices or excessive hardware errors for the interface. An incorrectly configured interface within the device operating system (OS) or interference for wireless connections can also cause problems at the Data Link layer.
- At the Network layer, we begin experiencing problems with network traversal. Network layer problems typically occur when network packets cannot make their way from source to destination. This may have something to do with incorrect IP addressing or duplicate IP addresses on the network. Problems with routing data or ICMP packets across the network or protocol errors can also cause problems here. In extreme cases, an external attack can also spike error levels on network devices and cause problems identified at the Network Layer.
- At the Transport layer, we isolate problems that typically occur with TCP or UDP packets in Ethernet networks. These may have to do with excessive retransmission errors or packet fragmentation. Either of these problems can cause network performance to suffer or drop completely. Problems at this layer can be difficult to track down because unlike the lower layers they often don't involve a complete loss of connectivity. Additionally, Transport layer problems can often involve the blocking of traffic at the individual IP port layer. If you've ever been able to ping a server but cannot connect via a known port, this can be a Transport layer problem.
The Session, Presentation, and Application layers are often lumped together because more recent interpretations of the OSI model tend to grey the lines between these three layers. The troubleshooting process for these three layers involves problems that have to do with applications that rely on the network.
These applications could involve DNS, NetBIOS, or other resolution, application issues on residing OSs, or high-level protocol failures or misconfigurations. Examples of these high-level protocols are HTTP, SMTP, FTP, and other protocols that typically "use the network" rather than "run the network." Additionally, specialized external attacks such as "man-in-the-middle" attacks can occur at these levels.
Network problems can and do occur at any level in the model. And because the model is so highly understood by network administrators, it immediately becomes a good measuring stick to assist with communicating those problems between triaging administrators. If you've ever worked with another administrator who uses language like, "This looks like a Layer 4 problem," you can immediately understand the general area (the Transport layer) in which the problem may be occurring.
|You'll hear seasoned network administrators often refer to problems by their layer number. For example, when you hear "that's at layer 3," it can mean an IP connectivity problem. Layer 4 can reveal the problem is due to a network port closure. Network administrators jokingly refer to problems that occur with a system and not part of their network as those "at layer 7."|
Let's talk about three different ways in which you can progress through this model during a typical problem isolation activity.
Three different approaches
Network administrators who use OSI as a troubleshooting framework typically navigate the model in one of three ways: Bottom-Up, Top-Down, and Divide-and-Conquer. Depending on how the problem manifests and their experience level, they may choose one method over another for that particular problem. Each of these approaches has its utility based on the type of problem that is occurring. Let's look at each.
The Bottom-Up approach simply means that administrators start at the bottom of the OSI model and work their way up through the various levels as they strike off potential root causes that are not causing the problem. An administrator using the Bottom-Up approach will typically start by looking at the physical layer issues, determine whether a break in network connectivity has occurred, and then work up through network interface configurations and error rates, and continue through IP and TCP/UDP errors such as routing, fragmentation, and blocked ports before looking at the individual applications experiencing the problem.
This approach works best in situations in which the network is fully down or experiencing numerous low-level errors. It also works best when the problem is particularly complex. In complex problems, the faulting application often does not provide enough debugging data to the administrator to give insight as to the problem. Thus, a network-focused approach works best.
The Top-Down approach is the reverse of the Bottom-Up approach in that the administrator starts at the top of the OSI model first, looking at the faulted application and attempting to track down why that application is faulted. This model works best when the network is in a known-good state and a new application or application reconfiguration is being completed on the network. The administrator can start by ensuring the application is properly configured, then work downward to ensure that full IP connectivity and appropriate ports are open for proper functionality of the application. Once all upper-level issues are resolved, a back-check on the network can be done to validate its proper functionality. As said earlier, this approach is typically used when the network itself is believed to be functioning correctly but a new network application is being introduced or an existing one is being reconfigured or repurposed.
The Divide-and-Conquer approach is a fancy name for the "gut feeling" approach. It is typically used by seasoned administrators who have a good internal understanding of the network and the problems it can face. The Divide-and-Conquer approach involves an innate feeling for where the problem may occur, starting with that layer of the OSI model first, and working out from that location. This approach can also be used for trivial issues that the administrator has seen before.
However, this approach has the downfall of often being non-scientific enough to properly diagnose a difficult problem. If the problem is complex in nature, the Divide-and-Conquer approach may not be structured enough to track down the issue.
Figure 4.2: Depending on the type of problem, a Bottom-Up, Top-Down, or Divide-and-Conquer approach may be best for isolating the root cause of the problem.
No matter which approach you use, until you begin to develop that "gut instinct" for your network and its unique characteristics, you should consider a structured method for your troubleshooting technique. Although utilizing a structured method can increase the time needed to resolve the problem, it will track down the problem without missing key items that drive resolution "band-aiding."
|Until you develop a solid foundation in troubleshooting, be cautious with the Divide-and-Conquer approach. This approach often treats the symptoms of the network problem without actually fixing the root cause.|
Let's move away from the general concepts of troubleshooting and talk now about some of the tools available for helping move through the layers.
About the author:
Greg Shields is a Principal Consultant with 3t Systems in Denver, Colorado - www.3tsystems.com. With more than 10 years of experience in information technology, Greg has developed extensive experience in systems administration, engineering, and architecture specializing in Microsoft, Citrix, and VMware technologies. Greg is a Contributing Editor for both Redmond Magazine and Microsoft Certified Professional Magazine, authoring two regular columns along with numerous feature articles, webcasts, and white papers. He is known for his abilities to relate highly technical concepts with a drive towards fulfilling business needs. Greg is also a highly sought-after instructor and speaker, teaching system and network troubleshooting curriculum for TechMentor Events, a twice-annual IT conference, and producing computer-based training curriculum for CBT Nuggets on numerous topics. Greg is a triple Microsoft Certified Systems Engineer (MCSE) with security specialization and a Certified Citrix Enterprise Administrator (CCEA).