"It's reduced the amount of time we have to work on our break-fixes," said Greg Crosby, network services manager for the Seattle Times. "Our biggest challenge is when there is a break-fix scenario and we don't have a smoking gun. We need to do a discovery to figure out what the problem is. It always varies, but when we did some sort of packet capture and analysis, we're talking about a day's work at the very least."
Considering that Crosby has a team of six people handling network services for the newspaper's Seattle headquarters, its print facility and more than a dozen remote offices, he needed a tool that could lower his mean time to
The network services team used to track down network problems by plugging laptops with protocol sniffers into ports on individual network devices to analyze packet traffic. The protocol sniffer would wait for a problem to occur, then look at the packets coming in and out of the device to see if that device was causing the problem. If it wasn't, Crosby's team would move the laptop to the next port on the network and wait for the problem to recur.
"We would do individual captures and bring them all together and look at the captures side by side, which is difficult when you don't have a system that's already aggregating all the data for you and seeing what the potential problems are," he said.
Also, looking at those individual captures side by side was an imperfect method because each one was taken during a different network failure. The Times had no method of tracking packets on all devices at once.
About eight months ago, Crosby purchased the OmniPeek packet analysis tool from WildPackets. He liked its ability to look at multiple devices simultaneously so that he could perform trend analysis across the network and examine how the network sees packets as they flow through the different devices on the network.
In addition to OmniPeek, Crosby's team uses WhatsUp Gold from Ipswitch for network alerts and utilization metrics. It uses CyberGauge to measure bandwidth utilization and Cisco Monitoring, Analysis and Response Systems (MARS) for event correlation.
However, the event correlation offered by MARS wasn't suitable for speeding up Crosby's break-fix process because it only looks at network device logs. Crosby wanted OmniPeek for its ability to examine packets.
"Logs will just be what [network] devices see as a problem, but when you look at packets you can see what hosts and servers see as a problem," he said.
Crosby said a recent problem with the company's FTP server illustrated how much OmniPeek has improved his team's ability to track down network problems. File transfers between the server and a host were failing in what appeared to be a random pattern.
"[FTP] just had this intermittent problem," he said. "It wasn't like every five minutes it occurred, or every half hour. It was just sort of random."
Crosby used OmniPeek to examine packet traffic among the FTP server, the host and the several network devices between them.
"We were looking at packets entering and leaving this particular device and four other devices as well," he said. "We could watch the trace for when the problem occurred and say, OK, through the router the packet looks clean. Through the firewall the packet looks clean. Then it would come through the load balancer and the packet just wouldn't show up."
Crosby said the load balancer was timing out FTP traffic at peak usage because of a configuration problem -- it was configured to time out various types of protocols after a certain amount of time had passed. Crosby said his networking team simply adjusted the timeout period for FTP on the load balancer and the problem was solved.
Solving the FTP problem took Crosby's team about a week. "Maybe that sounds like a long period, but when you have a problem that doesn't crop up on a system on a routine basis, you kind of have to trend it for a little bit. A week is quite short for trending," he said. "If we were using laptops, that would have been a mess. We would have had to put six laptops on there and we would have had to centralize our captures and analyze each capture independently."
Crosby said OmniPeek's ability to streamline troubleshooting has freed up his staff to work on other essential tasks.
"We have a lot of maintenance cycles we have to do, that maybe we were prolonging a bit because we were dealing with the break-fixes. Or we had infrastructure build-outs that we were working on that we had to delay because we were doing break-fixes. Once you free yourselves of them you can continue with other tasks," Crosby said.
And by devoting more time to maintenance, Crosby's team will be able to cut back on the amount of network problems that occur.
"If you're dealing with break-fixes you can't do maintenance cycles. And maintenance cycles help to reduce break-fixes," he said.
Let us know what you think about the story; email: Shamus McGillicuddy, News Editor