Network Evolution

Building the infrastructure for the changing face of IT

.shock - Fotolia

Manage Learn to apply best practices and optimize your operations.

Man vs. machine: In a network outage, who's to blame?

Downtime is often chalked up to human error. But several high-profile outages this year were blamed on faulty technology. Which is more dangerous to your network?

Every day, 25,000 people walk into one of 150 medical offices around the United States and donate their blood plasma to Grifols, a Spanish pharmaceutical and chemical company that turns the plasma into protein therapies to treat life-threatening illnesses.

Plasma donors, who are paid, can donate as often as twice a week. The process can take 90 minutes to two hours. Repeat donors are an important part of the business.

"We need to make their experience as best as possible and reduce the time they are there," says Josep Sans, global network manager at Grifols. "When the donor is in the center, we need to make sure we treat him well and get the plasma in exchange. This is our raw material to build our products."

Those donor centers are the company's most critical points on the network. That's why Sans and his team have begun implementing application visibility, application control and WAN optimization tools from Ipanema Technologies -- acquired by InfoVista earlier this year -- to avoid a network outage, accelerate network traffic and guarantee the availability of its Web applications. 

"If we lose communication with a center, then the system is broken," Sans says. "You cannot tell the donor, 'Please come back an hour later.' They will not come back. This is why our network needs to be online."

Although most companies don't lose blood during a network outage, such disruptions still hurt.

The cost of network downtime stemming from an IT failure can be as high as $100 million a year for a large business, according to a survey of 205 medium and large North American businesses conducted earlier this year by IHS Infonetics, a technology market research firm based in Englewood, Colo. On average, a business loses almost $4 million per year to downtime -- half a percent of their total revenue.

Companies suffer an average of two network outages per month, with each event lasting approximately six hours, the survey found.

Actually fixing a downtime problem is only 12% of the total cost, according to Matthias Machowinski, research director for enterprise networks and video at Infonetics. Loss of employee productivity and company revenue are the biggest expenses, he says.

Another cost, more difficult to measure, is the negative public relations hit.

"You only get noticed when you wind up on the front page of the Wall Street Journal for reasons you don't want to be," says Jim Metzler, a networking consultant with Ashton, Metzler & Associates.

You cannot tell the donor, 'Please come back an hour later.' They will not come back. This is why our network needs to be online.
Josep Sansglobal network manager, Grifols

That's exactly what happened in July, when United Airlines and the New York Stock Exchange (NYSE) made headlines for all the wrong reasons: Their networks suffered significant outages that paralyzed business operations. United blamed its downtime on a failed router, while the NYSE pointed to problems associated with a software upgrade on the exchange's computers. In September, American Airlines grounded all domestic flights out of its hubs in Chicago, Dallas and Miami for several hours due to what it described as a "network connectivity issue."

"More damaging than the outage is the perception in the market that you have a poorly run organization," says Rick Drescher, managing director of technical services at New York-based real estate services firm Savills Studley.

Often, all it takes is a single misconfiguration or wonky bit of code to wreak havoc on the network. Is technology the solution to that? Or is the ability to prevent and quickly recover from a network outage more about people and processes?

Greatest enemy of uptime: humans?

The blame for network downtime is spread among three culprits, according to Infonetics. The most common is a failure of equipment, software or third-party services. Following that are power outages; human error takes third place. Networks have the highest incidence of downtime caused by service providers, whereas applications have the highest rate of human error, Machowinski says.

But other recent studies have fingered human error as guilty for the vast majority of network outages. Avaya found that 82% of the companies it surveyed last year experienced some type of network outage caused by IT personnel making mistakes when configuring changes to the core of the network. Dimension Data, in its 2014 Network Barometer report, reported that humans get the biggest blame for IT service incidents.

"Most people who work in IT shops are trained in technology," says Metzler, who has worked as a systems developer at AT&T and was once responsible for transmission, switching and routing for the internal network at Digital Equipment Corp. "They weren't trained in processes. And their processes are weak."

It's human nature to shortcut a process, he says. On top of that, networking technology lends itself to making mistakes because it tends to have an arcane interface.

"That's kind of the perfect storm," Metzler says.

He believes software-defined networking (SDN) has the potential to improve processes, centralize network configurations and reduce the amount of manual effort required to manage networks. But he remains cautious.

"There is a real irony here," Metzler says. "SDN holds the promise of simplifying things dramatically, [but] the path from here to there is a very complex path. That's why we've been talking about SDN for three to four years and haven't really implemented it in the data center."

Not everyone completely agrees.

"SDN will not end all network outages," wrote Andrew Lerner, a research director at Gartner, in a blog post shortly after the outages at United and the NYSE.

"As you move toward SDN, you're ultimately running two architectures in the environment (since there's very little greenfield)," he wrote. "Thus, during the transition, you actually have more stuff to contend with, which can make networks more complex."

That may be true, but it's also true of every major technology migration, says Savills Studley's Drescher. Running parallel infrastructure, he contends, has allowed advances in production computing environments. 

"If what SDN can provide you in the long term -- automation, easier disaster response, removed or reduced reliance on specific hardware -- outweighs the risks of the process of getting you there, then you should leverage a combination of the technology, people and process you do have to move things forward," Drescher says.

Most common causes of network downtime

Start with a plan

The first step in developing a disaster recovery or business continuity plan: Figure out the organization's expectations for an acceptable amount of risk and downtime. Not all systems are equally important, so divide them into different tiers, depending on how long they can be down.

"A lot of organizations skip that step and automatically try to solve everything in one clip. That's the most expensive way to do it," Drescher says. "Everything has to be up all the time, and that's incredibly expensive if you're a large organization."

Another word of advice: Don't just write a business continuity plan. Test it -- frequently. Wait a couple years and you risk finding that many of the systems are not in your IT environment anymore. Employees may be gone, too.

In his role at Savills Studley, Drescher also works in the company's critical facilities group, which helps commercial clients locate, negotiate and lease data center space from colocation providers. For those enterprise clients that already use VMware, Drescher steers them toward the vendor's Site Recovery Manager tool to assist with automating and continually testing their disaster recovery procedures.

We don't have everything automated. And even if we did, people need to be involved; a machine cannot manage expectations.
Sebastian Pereiradirector of IT, Santex

"It handles your disaster recovery for you," he says. "It can be constantly watching data replication to make sure you're not falling out of your SLAs." 

Fulfilling service-level agreements (SLAs) was the problem facing Sebastian Pereira, director of IT at software engineering company Santex.

Pereira and his IT team, based in Cordoba, Argentina, realized they also needed an asset management system to track the laptops, phones and other devices his developers use. He implemented Samanage, a cloud-based IT service desk and asset management platform. By tracking his assets in more detail, Pereira found he was able to ensure uptime and monitor performance with greater ease.

"Sometimes I don't have full service expectations [for] my users. But [now] I can improve their expectation of service," Pereira says. "I could use the SLA to increase the perception of quality of service from my users."

The new Samanage tool has helped prioritize requests from customers and given Pereira more detailed information on his team's service performance beyond simply the time it takes to respond to an incident.

"We started looking at other things besides SLAs, like how many tickets we resolved the first time," he says.

For Santex, humans still play an important role when it comes to handling network outages.

"We don't have everything automated. And even if we did, people need to be involved; a machine cannot manage expectations," Pereira says. "That's one of the most important things when you're in a crisis: the way a person can manage expectations -- updating customers quickly, with good updates, and relieving anxiety."

But for Kelso & Company, a private equity firm in New York City, technology is certainly the answer to solving network downtime. Prior to the Dodd-Frank Act, a financial services regulation enacted in 2010, the firm was not required to archive email or present data recovery plans, according to Christopher Daniels, an IT consultant who manages the Kelso network.

Kelso invested in server-based Recover2Cloud from Sungard Availability Services in 2013 to back up the accounting server, email server, a domain controller and a file server.

And while it's difficult to predict when or how a network outage might strike, the ability to get services back online again quickly can be invaluable.

"When we ran our last test, I was expecting to wait several hours before we were back online. It was 50 minutes before we were up online in the test environment," Daniels says. When he brought up the servers, the recovery lag was just 90 seconds; previously, the servers had been four or five hours behind.

"If the SEC comes in here, we want to say we have the best in place," he says.

Article 3 of 6

Next Steps

Determining the cost of WAN downtime

Another take on system outages: Don't prepare for the worst

Upgrading the core when network downtime isn't an option

This was last published in November 2015

Dig Deeper on Network management and monitoring

Get More Network Evolution

Access to all of our back issues View All