Determining the impact of wide area network outages

Wide area network (WAN) outages are a fact of life, and the resulting impact on remote users can be significant. An understanding of that impact can not only help network engineers plan an appropriate response but can justify the expense of that response.

Unfortunately, almost all IT engineers will face network outages at their companies. Understanding the impact of an enterprise's wide area network (WAN) downtime, in terms of both lost productivity and sales, is a necessary step in developing countermeasures to minimize the impact to remote users. With a few pieces of data, however, a WAN engineer can get a picture of what it means to the organization when the WAN links fail and remote sites become separated from the rest of the network.

WAN networking vendors, particularly those that offer WAN redundancy or link binding solutions, are quick to cite some pretty staggering numbers, based on industry analyst reports. One report from Gartner, for example, claims that the average large enterprise racks up nearly 90 hours of downtime a year, and estimates losses of nearly $4 million because of it. Another study from Infonetics goes further, claiming an average annual downtime over five times higher. Unfortunately, the research reports that featured these numbers have become a bit dated, relying on data collected more than five years ago and not recently updated to reflect the latest technology trends. A push toward managed services for enterprise WAN connectivity is no doubt having a positive impact on lessening downtime events; but, at the same time, the higher bandwidth and simplified network designs of Metro Ethernet and Wide Area Ethernet services are encouraging enterprise to push applications and services out of remote sites and back into centralized data centers. With fewer services physically located at branch offices, the reliance on WAN services at these remote sites has been ever increasing.

Quantifying the real impact of WAN downtime

Determining the costs of a WAN downtime event poses an interesting challenge for most organizations. Essentially, one has to understand both the hard costs, such as average salaries and benefits for the employees who have been suddenly idled, and the soft costs of an outage, such as the loss of potential sales.

For obvious reasons, the hard costs are far easier to obtain. The costs of keeping the lights on at a remote site and of keeping it staffed with employees are relatively simple to calculate and break down into an hourly expense. Multiply that expense by the number of employees affected by every hour of the WAN outage and you have a good picture of the hard expense side of WAN downtime. One element that is usually missed from this side of the equation, however, is the cost of returning to normal after an outage. For example, if branch employees switched to manual processes during the downtime, additional time will be needed to catch the IT systems up, whether that means transferring manual invoices into the billing system or updating inventory levels with the latest production. This return-to-normal period needs to be factored into the equation because it plays heavily into the impact of even a short bout of downtime.

The soft numbers that factor into the network-outage-impact equation are much more difficult to specify. For example, it is nearly impossible to determine the potential loss of sales at a retail location because a customer got frustrated and walked out instead of waiting for the systems to come back. This half of the equation can also vary significantly among industries as well as the remote sites within an enterprise. In this case, the retail location being down certainly has a greater impact on the bottom line than a remote office filled with corporate staff. Because of the many variables and possibly sheer guesses that would have to be made on these soft numbers, many experts suggest simplifying the number a bit, breaking down an organization's profit number to a per-employee, per-hour level and using that to divine the sales impact of an outage. While this approach makes getting to a reasonable number easier, it should always be framed in the context of both the organization and the remote sites involved and adjusted accordingly.

Who is most at risk for WAN downtime?

While the effect of WAN downtime is technically the same for enterprises of all sizes, the ability to respond to and resolve a WAN outage is really what separates large organizations from small and midsized ones. Smaller companies usually lack the luxury enjoyed by larger enterprises of dedicated network staff who can focus exclusively on the wide area network. Similarly, in the event of a large regional or nationwide outage, service-level agreements, as well as the sheer size of their WAN contracts, will probably put larger enterprises higher on the priority list to have their service restored, with the smaller organizations waiting their turn. A general understanding of where your organization might fall in the priority scale will help guide the approach taken to minimize WAN downtime.

Mitigating WAN downtime

Ultimately, the approach to minimizing WAN downtime has to be a proportionate response to the impact an outage has on the organization. Building a fully redundant WAN network might be a necessity for a large financial services organization, for example, but would be too costly for a small operation. Smaller organizations in particular have to balance the cost of providing an alternative access path with the potential impact of an outage of the primary links.

The threat of a network outage should also be considered when deploying new applications and services across the wide area network. For example, while it might be more cost effective to deploy a VoIP solution that feeds back to the corporate data center, provision should still be made to deliver local dial tone in the event of a WAN failure. Likewise, if the impact-analysis work determines that there are applications that simply must not go down, enabling secondary WAN links as a failback or deploying the application in a more distributed method might be required. And if a WAN outage threatens to go beyond a short-term problem, network engineers should be prepared to enact their WAN disaster recovery plans to get their remote sites back up and running. Based on the impact calculations, a downtime event lasting more than a couple of hours could certainly be defined as a disaster. Network engineers should bring to bear every option available to them to keep their remote users connected, as long as the impact of WAN downtime warrants it.

This was last published in July 2010

Dig Deeper on WAN technologies and services