vasabii - Fotolia
- Steve Zurier, ZFeatures
Network outages can cost organizations millions of dollars and dramatically damage their reputations. Just ask Southwest Airlines and Delta Airlines, carriers that suffered major network outages last summer. Southwest's outage cost an estimated $54 million and the Delta power outage reportedly cost $150 million.
Industry experts say the airlines have been struggling to deliver more advanced technology services to their customers, putting them at risk for network issues. Travelers of all stripes are ordering flights online and want to receive tickets on their cellphones, putting a strain on booking and ticketing systems and corporate networks.
While what happened to the airlines has been fairly well-documented, the outage part is not a unique story. Most other industries are experiencing strains on their networks as well, and outages can and do happen in many other sectors -- hitting major banks, telecommunication providers, cloud providers and universities. A one-day outage at Salesforce once cost the company $20 million.
Networking analysts say organizations can reduce the pain from network outages by following some standard best practices.
"In the case of Southwest Airlines, where a router went down, that really shouldn't have happened," said Dan Conde, an analyst who covers networking technologies for the Enterprise Strategy Group.
Conde said companies need to think in terms of three-to-five-year refresh schedules for core infrastructure and focus on built-in redundancy. They should also take advantage of modern network management tools that offer visibility into the network.
Roberto Dovalina, associate director of digital infrastructure at St. Edward's University, based in Austin, Texas, said that's precisely what his team does. He and his colleagues support roughly 5,500 students and 1,200 faculty and staff.
Dovalina said, at St. Edward's they've deployed redundant core routers, firewalls and server chassis in the data center, equipment that they refresh every three to five years. They also replace the 12 routers that support the campus buildings every five to seven years.
By having the redundant infrastructure, St. Edward's can periodically take down each piece of equipment for half a day to run tests.
They've also built intelligent logic and scripts into the system, so in the event the core routers go down, they can bring one or both of the routers back up and have all the applications ready to go with minimal downtime.
Best practices to prevent a network outage
Here are nine tips from ESG's Dan Conde on how to prevent network outage at your organization:
1. Follow these best practices end-to-end. Your network is only as strong as its weakest link.
2. Do the basics. Maintain hardware, avoid old systems, run diagnostics, have proper power supply with backup and run power system stress tests.
3. Run drills for the whole system. Run shutdowns on some links and see if the proper failovers occur. If they don't, you might have a configuration problem.
4. Use router standby protocols, if available. Be sure to have redundant links between router layers -- and use protocols such as Virtual Router Redundancy Protocol or Hot Standby Router Protocol so standby routers can take over if a primary one fails.
5. Partner with your ISP. Have alternate paths from the network carrier. Also, pay for enough bandwidth so that if standby paths get saturated, they don't cause a cascading failure.
6. Use updated network management tools. Have proper network visibility and monitoring tools that are used all the time, during drills and during app deployment tests. This part is critical and key to service assurance -- if you don't see a problem, you don't know how to deal with it.
7. Think about the application layer. Design the whole architecture so that the infrastructure works with the apps. Don't force-fit the app to use the infrastructure you have -- architect the apps first and then design the infrastructure to meet its needs. Look at both elements together.
8. Be thorough. Look for link failures and device failures. Don't focus on one at the expense of the other.
9. Follow up. If you do have a failure, in real-life or in a drill, do a thorough post-mortem analysis.
"Fixing the network after an outage is fairly straightforward and may not take much time," Dovalina explained. "It's bringing back the applications after an outage that is the most time-consuming. So by using the intelligent logic and scripts to bring back the application automatically, we can bring back the system fairly quickly. Users will barely notice an outage in the event some of the equipment goes down."
'Don't break anything'
According to Dimension Data's 2016 Network Barometer report, 37% of network service incidents are due to human error, many of those related to configuration mistakes.
Organizations are taking steps to correct the configuration issue. At St. Edward's University, Dovalina said an engineer needs to approve any configuration change to the network.
Fidelity Information Services takes this concept one step further. Robert Lumsden, enterprise network engineer, said every change ticket requires a full peer review. And prior to the change, the engineer, internal customers and any other pertinent stakeholders -- such as staff from the accounting or sales department -- hold a meeting so the engineer can fully explain the change and respond to any questions.
"What we try to do is evaluate the risk if something goes wrong," Lumsden said. "Our motto is 'don't break anything.'"
Closer partnerships, lifelong learning
Organizations also need to form better partnerships with their vendors, ultimately minimizing the risk of network outages, St. Edward's Dovalina said. He added the university has worked closely with Extreme Networks to deploy its switches and routers.
Robert Lumsdenenterprise network engineer, Fidelity Information Services
"When selecting a vendor, you have to ask yourself, 'Do they provide a full solution or just network equipment?'" Dovalina said. "When we start a project with Extreme Networks, we work together to strategize and define the solution, then set up a proof of concept that lets us test it as long as we need to before it's deployed in production."
It's also important to keep learning about the latest networking trends. Dovalina said he and Paul Miklas, senior network administrator at St. Edward's, make sure to attend local and national trade shows regularly.
"People wait for the technology to come to them," Miklas said. "We try to be proactive. For example, we're spending a lot of time now learning about emerging technologies and how they can fit into our operation."
The rise of programmable networks
Some advocates of programmable networking technology say it can minimize the risk of network outages by reducing the burdens of manual configuration and the associated potential for errors.
"For the past 20 years, managing networks has been more or less the same," said Jeff Reed, senior vice president of enterprise networks at Cisco. "Customers tell us that their network engineers spend 80% of their time just keeping the lights on. Many of the processes are manual-based tasks that keep top technology people from focusing on the applications that make the business tick."
Reed said Cisco has been focusing on more efficient design in its switches, which also helps network engineers reduce configuration time. Vendors such as Brocade Communications Systems Inc., Pluribus Networks and Barefoot Networks also offer programmable networking technology.
"We're trying to build more intelligence into the switches, so network engineers only have to deal with high-level policy," Reed explained. "Network engineers should be focused more on which application traffic they need to prioritize as opposed to the nitty-gritty of network design."
And sure, while better, more efficient switches and routers can minimize the risk of network outages, networking organizations will still need to heed the best practices outlined by Dovalina and Lumsden. The new switches may last a bit longer and reduce overhead, but networking organizations will still need refresh policies. And even if the vast majority of configurations are automated, they will still need to be monitored in the event something irregular happens.
Finally, networking organizations will need to review the landscape and pick the networking partner that can best take them into the future. Some organizations will stick with in-house data centers and may remain loyal to familiar technologies and policies. But economics may drive many enterprises to operate at least a part of their data center with cloud providers such as Amazon Web Services and Microsoft Azure, a path that will require new ways of thinking and new technologies such as open source networking.
But that's a topic for another day. For now, all you may want to do is keep the lights on.
Network testing gives managers insights into outages
How to calculate the cost of a WAN outage
With next-gen network analytics, stop outages in their tracks
Dig Deeper on Network Infrastructure
Google services outage: Gmail, YouTube and Docs temporarily down
Cloudflare apologises for major net outage
What's the difference between an edge router vs. core router?
CityFibre launches consumer broadband partnership in Aberdeen