Troubleshooting network problems is hard. Packet loss, oversubscription, security patches and software version...
control continue to give network engineers nightmares. But many IT pros are finding that good network design and governance can take away some of those bad dreams.
Patrick Miller, manager of architecture and desktop services at Sparks, Md.-based Apex Tool Group LLC, remembers trying to track down a recurring problem on a Token Ring network many years ago that could easily recur in any enterprise network today.
"I had a situation where every night at 10 p.m. our UPS [uninterruptible power supply]systems out in the plant would go down and no one could explain why," Miller said. "So I went out there with sniffers, laptops doing pings and traceroutes. At the end of the day, I visually followed the cable."
Miller found a powered Controlled Access Unit switch that was unplugged. A cleaning lady was unplugging it every night so she could power her vacuum, and in the process brought down the network. "Weird things like that happen," he said. "Sometimes you can send $10,000 worth of equipment out there trying to resolve the problem. Sometimes you can just track the cable. Things that show up as packet loss can be a completely different issue altogether. Packet loss is one of those weird things where sometimes there is just no magic bullet."
Magic bullets for troubleshooting network problems are hard to find, but network engineers would like to reduce the amount of time they spend on these issues. Unfortunately, a lot of companies have a long way to go. MyITassessment.com, a Software-as-a-Service-based provider of infrastructure assessments that helps large system integrators assess customer networks, has assembled some telling stats from the scans it has performed on more than 2,000 enterprise networks.
- Level 3 devices drop packets in 63% of enterprise networks.
- Oversubscribed switches cause performance problems in 35% of networks.
- Unpatched security vulnerabilities show up on switches and routers in 44% of enterprises.
- More than 75% of enterprises have inconsistent versions of IOS on devices within the same product family.
- Switches and routers have aged out of vendor support in 54% of networks.
Although these problems persist, many network engineers are finding ways to combat them.
Tackling packet loss and oversubscription
Engineers will never eliminate all packet loss from their networks, but tight monitoring and better network design can help. Forrest Schroth, network manager at global staffing firm Randstad, oversees 300 sites on his Multiprotocol Label Switching, or MPLS, cloud, and he closely monitors four metrics to guard against packet loss.
"I'm always looking for jitter, misassembly of packets, either from a telco or a bad interface card on our side. I'm making sure utilization doesn't breach certain thresholds, and I guess you could add latency to that," Schroth said. "When I come in in the morning, I have a chart that shows me all my sites and who is taking the most errors, the most jitter, who's taking the most utilization. And we do traffic-shaping with that. If there are errors on the line, we call whatever carrier we're interfacing with, and between our provider edge and the stuff on our customer edge, we figure out where the errors are coming from. That's a daily task for an engineer."
Packet loss, however, is harder to trace on the LAN, said Rich Siedzik, director of computer and telecommunications services at Bryant University in Providence, R.I.
"Usually for us, you spot [packet loss] after the fact [when] you start to see degradation in service or a user complaint comes in. Then you go looking for it. It's tough to do because there are so many segments and so many different paths," Siedzik said. "To run something on every single [path] would be almost impossible. We prioritize visibility when we go between different segments on the network, like a segment going from the core to the distribution layer. Then, as you get out to the access layer, there is less monitoring because there are more points to monitor."
Quite often, packet loss results from a bad cable or port. Sometimes, it's a bad design. The biggest design mistake network engineers make is focusing on bandwidth rather than on a switch's ability to process packets, Randstad's Schroth said. "Just because it's a gigabit interface doesn't mean it's going to accept all traffic," he said. "I'm more interested in the rate of speed a piece of equipment can accept traffic, the packets per second ratio. I see a lot of people getting into 10 Gigabit. That's great, but you need to make sure that equipment is line rate."
Jeremy Littlejohn, CEO and chief analyst at myITassessment.com, agrees. Too many engineers throw bandwidth at problems instead of getting to the root cause. "Somehow bandwidth got to be the lead metric on everything, and it's not good," he said. "Engineers should focus on packet loss and look at whether [a lack of bandwidth or something else] is actually causing packet loss."
Oversubscribed switches and routers also continue to bottleneck networks. Sometimes a device will get oversubscribed due to bad management of individual devices. Some enterprises lose track of the backplane capacity on modular switches and routers, and install too much bandwidth on line cards.
"Eight ports may all be sharing an ASIC [application-specific integrated circuit] backplane that is oversubscribed, and that is a silent killer when we add all these virtual machines," Littlejohn said. "We plug them into these groups of ports that we think we're getting 8 gigabits out of, when we're only getting 1 gigabit."
Even if you do carefully avoid oversubscription on devices, logjams will persist.
"While I might add bandwidth here, that just means that area isn't going to be bandwidth-constrained, but the bottleneck always moves," Randstad's Schroth said. "The question is, will it be an application? Will it be a different WAN link? Will it be a switch port link? At some point there is always going to be a slowest point of a connection." Given their inevitability, a network engineer needs to predict where that next bottleneck will happen and make sure that point is well monitored, he said. "At some point, any connection is going to have a slowest link, and keeping an eye on that slowest link is the name of the game."
Tracking OS versions, security patches, and age of switches and routers
Other challenges highlighted by myITassessment.com -- OS versions, security patches and age of equipment -- point to issues with overall governance of assets. These problems can affect an enterprise's ability to scale and automate its network. "It affects how they can scale support if they can't have some standardization about what's out there," Littlejohn said. "And doing anything that's automated, as soon as you introduce a variable like different OS versions, you'll not be able to execute that effectively."
Siedzik said he's been able to improve Bryant University's approach to tracking these issues through the use of better tools, including the Cisco Network Collector (CNC), an asset tracking appliance typically used by Cisco value-added resellers. "It shows us all our code levels. It shows us where we're at with vulnerabilities, what we need to patch and what we need to upgrade," he said. "We have all that in report formats. Before that, it was best effort."
Before installing CNC, Bryant's network administrators often discovered a switch had reached end-of-life when they called the Cisco Technical Assistance Center for support on the device. "That was the old way. We can't operate in that fashion anymore," Siedzik said.
The university has also used CNC to standardize the IOS versions on its devices. "Before, we had a lot of diverse code revs," Siedzik said. "We had a lot of closets all around campus, and we'd find in this stack we had IOS levels of this, and in another closet it was something else. It made us vulnerable and put us at risk where if we were going to do an upgrade on a particular code, we didn't know if it would have any ramifications on things upstream or downstream. Now we do these major code updates twice a year."
The Randstad staffing firm commissioned a major greenfield network install six years ago, which made it easy for Schroth to impose strict version control in his network design, a policy he enforces with Cisco Prime management software. "I know if someone has made a change to a configuration or changed an OS," he said. "And very often we'll run inventory reports to make sure everything is running on the proper versions. Usually I do it around hardware reversioning, too. We went from the 2600 to the 2800 series Cisco router. When we replaced our equipment, we picked the OS that would work across our enterprise, and we templated that. When we do that again, we'll pick new software and try to find one that is compatible and stable enough to put on a template, so, based on the size and function of an office, we have a certain configuration we use that would determine what OS and what hardware would run there."
Good governance doesn't automatically mean you can standardize on one codebase, because switch and router software remains idiosyncratic. "We have a lot of voice in our network, so we know depending on the model of router, that a certain version of IOS works really good for voice," said Apex Tool Group's Miller. "But if I have another router that I need to turn BGP [Border Gateway Protocol] on, I know there may be a vulnerability in that IOS version. So, I need to go to a later version. It's often not the case where you want to be on the most up-to-date version of IOS, because a lot of times there are bugs in there that people don't know about yet." Miller tracks these different code releases on his infrastructure through detailed documentation, but he'd like to see better tools from Cisco, especially given the complexity of managing different OS versions for different applications.
Engineers will always face the delicate job of balancing risk and stability, Miller said. While some network managers would like the simplicity of having uniform code with the latest security patches, that aspiration just isn't workable in production networks. "It comes down to what makes the most sense for my business," he said. "Are my switches vulnerable, or is stability the most important thing? It depends on where that switch lives."
Let us know what you think about the story; email: Shamus McGillicuddy, News Director.