What do you think of when you hear the phrase "TCP/IP troubleshooting?" People who are visually imaginative may see a flowchart. More linear-minded types may see a series of numbered steps. Others (far too common) may feel a sense of inadequacy and frustration.
TCP/IP troubleshooting should be simple, right? After all, it's just a protocol -- a series of steps to transfer bits over the network. But what a protocol: four layers, with multiple protocols at each layer.
The traditional troubleshooting approach
Some years ago when I first learned about TCP/IP networking, I was taught a simple follow-these-steps approach to troubleshooting problems. The method went something like this:
- Type ipconfig to check if your IP address, subnet mask, and default gateway are correct.
- Now ping 127.0.0.1 to see if your network adapter is working.
- Now ping your own computer's IP address.
- Now try pinging the IP address of another computer on the same subnet.
- Now try pinging your default gateway (the near-side interface of the router that connects your subnet to the rest of the network).
- Now try pinging the IP address of a computer on a different subnet.
- And so on.
I call this the "brain-dead approach" because it's so methodical you can basically turn off your brain and just follow the steps. It's also somewhat inefficient, for it automatically assumes that your problem most likely starts with your own computer and that the problem is more likely to be closer to you (your network card, your computer's IP address configuration, your local subnet) than further away (other subnets). And it's a method that was probably developed before the Internet really took off -- that is, before DNS became ubiquitous for name resolution and before firewalls and VPNs became a fact of life for most corporate networks.
What I mean is this: one of your users says "I can't connect to the server right now." What could be the problem? It helps to dissect this simple sentence to understand the issues that may be involved. For example:
Is this the only user who has called in reporting network problems? If there are others, do they have similar issues? If so, then right away it's clear you don't need to take a brain-dead approach and begin your troubleshooting at the user's computer. Instead, the issue is most likely "out there" somewhere, and that could mean maybe your DNS server is offline or your DNS provider services may be experiencing difficulty. Or maybe a router on your internal network may be going crazy and dropping packets. Or maybe the server your users are trying to connect to may have crashed.
You should also stop and think about any commonalities these users may have. For example, are their machines all on the same subnet? If so, then maybe the default gateway for that subnet is misconfigured or the router crashed. Or maybe a contractor working in the plenum crawlspace has accidentally cut a network cable connecting the subnet's workgroup switch to the department's main Ethernet backbone switch. Or maybe someone malicious has installed a rogue DHCP server on that subnet and it's stealing machines as their leases come up for renewal and assigning them unroutable addresses to create a denial of service condition.
If it's only that one user though who has the problem, then it's probably time to play braindead and start asking questions like "OK, is your computer turned on? Is the network cable securely attached at the back of your machine?" and so on.
A good question to ask this user is "What do you mean by connect?" That's because "connect" is a technical-sounding word that users often use to impress Help Desk to show they know what they're talking about. Well, they usually don't. Why? Because there are different kinds of connectivity including MAC-level communications, TCP sessions, password-authentication, access rights and privileges, NAT-traversal connectivity, firewall pass-through, application-level sessions, and so on. What kind of connectivity problem are they actually having? What are they trying to do when they say they want to "connect to" the server? Are they trying to access a share on that server? Do they get an "Access denied" message when they do this? Are they getting a login box prompting them for credentials? Is it rejecting their credentials? Are they having trouble finding the share in Active Directory? Is it a mapped drive they are having problems with? Are they trying to browse to find the server in My Network Places? And so on.
And is it just that server they're having trouble connecting to, or are they having problems connecting to anything on the network? Determining the scope of the problem here is important: Is connectivity failing in just one way or many ways?
You've got this user over here, and this server over there, and the network between. They can't connect. Why? Well, where exactly is that server anyway? Is it on the user's subnet? On an adjacent subnet? In a different department? On a different floor? In a different building? On a different continent? What kind of network connects the user with that particular server? A wired Ethernet LAN? A wireless LAN (WLAN)? A fractional T1 line? Frame Relay? A VPN tunnel over the Internet? A dial-up modem connection? Cable modem or DSL?
First determine the type of connection (possibly several types) between the user and the server, and then ponder where things might break down. Maybe the CSU/DSU has gone wonky; try recycling its power or contact your service provider who should be monitoring it. Maybe the janitor is cleaning the server room and he bumped a power bar and an Ethernet switch has gone offline. Check for an alert message from your network management software, assuming you're using managed switches. Maybe there's been a power blackout at the remote branch office where that server is located. Call them on the phone and see what's happening.
And is it server or servers? Is the user having trouble connecting to only that server or to other servers as well? Are others having problems connecting to other servers also? What are the commonalities (if any) between all the servers being affected? (Or apparently being affected -- remember, the problem may be with the users' computers or more likely with the network infrastructure itself.)
The time element is crucial in troubleshooting. Did the problem just start happening? When was the last time you successfully connected to the server? How long has it been going on for? Is it continuous or intermittent? Intermittent network problems involving unreliable WAN links and other issues can be difficult to troubleshoot, especially if they're transient, i.e. brief and occasional.
Time can also help you relate the problem to other circumstances that might be impacting your network. Did the problem start this morning at 10 am? What else happened on your network around then? Were patches applied by a WSUS server? Did scheduled maintenance on a domain controller occur? Was a construction crew in the building compound using a backhoe to repair a water main break?
A structured approach
My own approach to TCP/IP troubleshooting is structured around three critical areas:
- Determining the elements of the problem. This means:
- Client end: The client(s) who are experiencing the difficulty (or difficulties) (the user end).
- Server end: The server(s), printer(s), or other network resources (such as the Internet) that the clients are experiencing difficulty with.
- Network in between: The wires (if not wireless), hubs, switches, routers, firewalls, proxy servers and any other network infrastructure between the client end and the server end.
- Environment: External circumstances that may be affecting your network like power fluctuations, building maintenance and so on.
- Scope: One or many clients/servers involved.
- Time frame: Continuous, intermittent, occasional; when did it begin; and so on.
- Type of connection problem: Physical, network, transport or application layer; authentication or access control; and so on.
- Signposts: Error messages on client machines; login boxes; and so on.
- Determine which troubleshooting steps might apply given the above problem elements. This includes:
- Verifying physical media connectivity for the client(s), server(s) and network infrastructure hardware involved. This means checking cables, making sure network adapters are properly seated, and looking for other causes of network connections displaying a media disconnected state.
- Verifying TCP/IP configuration of the client(s), server(s) and network infrastructure hardware involved. On the clients and servers this means IP address, subnet mask, default gateway, DNS settings and so on. For network infrastructure hardware typically means routing tables on routers and Internet gateways.
- Verifying routing connectivity between the client(s) and server(s) involved. This means using ping, pathping, tracert and other similar tools to verify end-to-end TCP/IP connectivity at the network level; packet sniffing to monitor transport layer sessions; using nslookup, telnet and other tools to troubleshoot application layer issues involving name resolution problems, authentication problems and so on.
- Understand it, question it, test it.
- Understanding how protocols work, how packets are forwarded by routing tables, what tools like Netdiag.exe can tell you, is critical. Successful TCP/IP troubleshooting is founded upon a good understanding of how TCP/IP works and the tools that can be used to test it. If you've never plodded through trying to understand a Network Monitor trace, you'll have difficulties troubleshooting certain kinds of problems.
- Asking the right questions is also critical to good troubleshooting. Learning when to be methodical and when to take a mental leap is the essence of the art of troubleshooting, and it involves full use of both your left brain (logic) and right brain (intuition).
- Finally, getting your hands dirty and actually testing things to try and isolate the problem is critical, and to do this you need a toolbox of troubleshooting tools you know how to use. There's nothing like lots of experience to help you solve a difficult problem, even if it's something you've never seen before.
Troubleshooting TCP/IP networks can be frustrating, but it can also be fun. In future articles we'll zoom in on the troubleshooting steps and tools you need to be able to do in order to successfully solve the issues that might arise on your network. Until then, stay connected!
About the author:
Mitch Tulloch is a writer, trainer and consultant specializing in Windows server operating systems, IIS administration, network troubleshooting, and security. He is the author of 15 books including the Microsoft Encyclopedia of Networking (Microsoft Press), the Microsoft Encyclopedia of Security (Microsoft Press), Windows Server Hacks (O'Reilly), Windows Server 2003 in a Nutshell (O'Reilly), Windows 2000 Administration in a Nutshell (O'Reilly), and IIS 6 Administration (Osborne/McGraw-Hill). Mitch is based in Winnipeg, Canada, and you can find more information about his books at his Web site: www.mtit.com.