This content is part of the Essential Guide: Examine DR in the cloud from all angles

Amazon cloud failure highlights reliability, DR concerns for providers

Cloud providers can learn from the recent, high-profile Amazon cloud failure by revisiting their disaster recovery (DR) strategies.

Online businesses like Instagram, Pinterest and Netflix recently learned the hard way that the cloud is not invincible during the second Amazon cloud failure in a month. Although other cloud providers may be tempted to enjoy a little schadenfreude, the latest outages underscore some of the disaster recovery (DR) challenges and opportunities all providers face.

Two of the data centers in Amazon Web Services' (AWS) US East-1 region -- located in Ashburn, Va. -- suffered significant outages June 29 after a large electrical storm tore through Northern Virginia and Washington D.C., knocking out power throughout the region. The two data centers were supporting one of Amazon's "availability zones" in that region. Amazon defines these availability zones as "distinct physical locations [that] are engineered to isolate failure from each other," but offers no further clarification for how the zones are set up.

For all its redundancy, the cloud is still based on physical equipment -- and physical equipment fails. And even if cloud providers can't control the weather, they must still have a Plan B in case of a cloud outage.

While it's common for large, pure-play cloud providers to divide resources among different availability zones and separate physical data centers, many customers -- especially those that rely on the cloud to provide Web-based services to their end users -- are beginning to have reservations about reliability.

As customers begin to consider a multi-vendor cloud environment for their DR needs, cloud providers must figure out how to support them. Highly publicized outages like Amazon's cloud failure may provoke customer anxiety across the market, but it also affords cloud providers the opportunity to learn from those mistakes and prepare a better DR plan.

Latest Amazon cloud failure: What happened?

Large-scale cloud outages continually hit the news, and this is not the first time an electrical storm and power outage has crippled part of Amazon's cloud. Its Virginia data centers were battered by electrical storms in 2009, suffering power outages and service disruptions in June and December of that year. Lightning was also blamed in August 2011 for taking down Elastic Compute Cloud (EC2) instances in Dublin.

The latest Amazon cloud failure also follows a comparatively minor outage it suffered in the same region June 14, which was triggered by power failures as well.

During the most recent AWS outage, one of the two data centers affected by the storm did not successfully fail over to generator power after the data center's electrical switching equipment was overcome by a large voltage spike, according to a recent statement from Amazon about the incident. The subsequent combination of EC2 instances going offline and the discovery of a bug in the Elastic Load Balancer control plane triggered an Amazon cloud failure lasting six hours for customers in that availability zone.

Amazon competitor Joyent later tweeted it had not experienced any service disruptions that night, despite being located in the same Virginia data center where Amazon suffered the outage.

Can cloud balancing save the day during future cloud outages?

The ability to load balance across a heterogeneous cloud environment, also know as cloud balancing, can help providers prevent a cloud outage or any unexpected spike in traffic, enabling them to promote redundancy and even cost savings for the customer, said Apurva Dave, vice president of product marketing  at San Francisco, Calif.-based Riverbed Technology, a wide area network (WAN) optimization vendor.

More on Amazon cloud failure

What cloud risks should businesses consider after Amazon's EC2 outage?

Amazon cloud failure: Cloud views must get realistic 

Amazon cloud outage caused by hardware, not hackers

"[Cloud] failure isn't always just an outage, like AWS experienced -- it can come in shades of gray, like when webpages are loading too slowly," he said. "For applications that are mission-critical and cannot suffer downtime in the cloud, cloud balancing is a good insurance policy for customers."

But everything comes at a cost, noted Sam Barnett, directing analyst of data center and cloud for Campbell, Calif.-based Infonetics Research Inc., and not all customers may be willing to pay a premium for cloud balancing.

"Having a way to interconnect or workload mobility does cost extra, and it's really up to the customer to determine those costs and benefits," he said.

Amazon cloud failure: Are customers flocking to multi-vendor cloud strategies?

Because of the inherent vulnerabilities in the physical components of the data center, cloud failures are unavoidable, Barnett said. As customers become more aware of this, cloud providers should not be surprised when more customers that heavily rely on the cloud to deliver Internet-based services no longer want to be tied strictly to a single facility.

Nirvanix Inc., a cloud storage provider, is seeing momentum for this trend. Nirvanix hosts petabytes of customer data in its own cloud, and many of its customers also maintain a copy of that data in another provider's cloud, said Steve Zivanic, vice president of marketing for San Diego, Calif.-based Nirvanix.

It's not that customers want two copies in different locations in a single provider's cloud, he said. Rather, they're replicating the data across several geographic locations and multiple cloud providers to ensure continuous data accessibility for the customer, he said.

Then again, not all cloud providers should expect a surge of DR business. Customers that are happy with their cloud provider tend not to move or consider another provider's offering, Barnett said, noting that cost savings may also drive customers to stay with one provider.

"It really depends on the business need, and the application supported in the cloud," he said.

Cloud providers should strive to design their data centers in areas that geographically make sense, and learn from the mistakes of cloud providers like AWS, Barnett said. Balancing redundancy and reliably for customers, while keeping costs low, will be mission-critical for cloud provider to stay competitive moving forward.

Let us know what you think about the story; email: Gina Narcisi, News Writer.

Dig Deeper on Telecommunication networking

Join the conversation

1 comment

Send me notifications when other members comment.

Please create a username to comment.

After so many Amazon cloud outages, do you still see AWS as a serious competitor?