Last month, a LAN outage at LAX shut down the airport, stranding thousands of passengers for up to nine hours. Shortly after, a massive Skype outage was caused by a software bug that was triggered by a large number of customers downloading the software, then rebooting their computers. Then, AT&T's EDGE network for wireless data had an outage.
These are just some recent examples of network outages and the inconvenience they can cause, especially if they're not attended to in a timely manner. They also highlight the need for systems that can diagnose network problems and cut down the mean time to repair (MTTR).
A recent study conducted by application and network performance management vendor NetScout Systems Inc. found that many companies are taking too long to diagnose and fix network performance problems. The study, which polled 284 users, found that 75.9% of IT managers typically hear about performance issues from irritated end users who call the help desk.
"We're still finding a lot of problems that come to light when users complain," said Eileen Haggerty, NetScout's director of solutions marketing. "We need to get ahead of these problems."
Haggerty said companies are failing to get ahead of performance issues because they lack IT resources, meaning that network spending is increasing but the number of people managing the network and applications is staying level.
"You're getting more complexity and [fewer] people," she said. "They're going to have to add more visibility [into network performance problems] and more automation of problem detection."
Adding to the problem is the apparent lack of communication between application and networking groups, Haggerty said. The study found that more than 25% of network managers view the relationship between the applications and network groups within their organizations as "slightly or moderately adversarial." She said companies need performance management solutions that bring various IT groups together to discuss performance issues, instead of pointing the finger when a problem occurs.
"These problems can affect business and revenue," she said, adding that respondents are "ending up having to prove it was not the network" causing the problem. In-fighting, she said, can lead to more problems and time wasted trying to solve them.
And fixing network problems and reducing the time it takes to do so is top of mind for upper management, according to the study's findings. Respondents were quick to point out that upper management is expecting them to resolve network and application-related issues more swiftly. Twenty-three percent of respondents have a management by objectives (MBO) on MTTR. From January 2006 to earlier this year, that number increased by a whopping 272%.
The increase in MBO rates illustrates that more network managers are being asked to formally report on their ability to address performance problems, Haggerty said, so they need reliable reporting functions.
The study also revealed that more companies are implementing some level of quality of service (QoS) to ensure that applications are running correctly across the network, in hopes of avoiding performance problems. Eighty-three percent of organizations polled reported that they have either implemented QoS policies or plan to do so over the next year. More than one third, 36.4%, said the primary reason for implementing QoS was to gain a handle on VoIP traffic, while another 31.8% said they use QoS as part of a WAN services offering or to support VoIP over an MPLS WAN.
Gary Abbott, network capacity planner and technology adviser with InterContinental Hotel Group (IHG), said that his company at one time relied on user complaints to determine when the network was suffering performance problems.
"We spent many years doing it the good old-fashioned way, which is waiting for the phone to ring," Abbott said. "The process was very time consuming."
Abbott said he considered upping the bandwidth, but "if you have machines that are acting abnormally, you can't fix it by throwing bandwidth at it."
Using NetScout's nGenius suite of monitoring tools, Abbott can alleviate the pain of the past. He no longer worries that the phone will ring with complaints about network slowdowns. Now, he said, when new applications are being rolled out, he watches to see what the traffic will do, which allows him to plan for capacity and ensure a successful rollout.
Abbott recalled a recent incident where certain services would blast 160 Mbps of data at one another once an hour. The massive blast of data would slow network and application performance to a crawl. In seconds, he was able to spot the issue and look at the main network arteries to pinpoint the problem. In the past, it could have taken an entire workday to trace the source of the trouble.
Now, Abbott said, IHG has a number of large screens that he watches to keep tabs on his most important network arteries. After watching them for a short time, he said, "you have a feel for what normal is" and can detect problems as they pop up, or just before.
"I added baselines so I can see normal," he said. "In that case I saw the spikes showing up and found who the talkers were -- the two servers blasting data to each other. I scheduled a restart of the boxes and the spikes went bye-bye. It took a couple of calls and sending out a chart, but the old way would've taken forever. I don't think I spent an hour on the entire thing."
Moving forward, Abbott has put together a methodology to review reports and discover problems. He advises others to do the same.
"You must look at reports to find actionable items," Abbott said. "You have to figure out ways to minimize the time it takes to solve a problem and make that a part of your day. You have to catch things. You have to predict the future, defeat space and time, and change the speed of light."