High Availability virtualization (HA) applications—the tools that promise to automatically restart failed VMs after...
either a VM crash or host failure—give developers and server administrators false hopes.
Server teams begin to believe they can apply HA tools to any hodgepodge of enterprise spaghetti code, but recent inter-DC VM mobility challenges are a direct consequence of the false beliefs in HA magic.
Let’s consider the myths and realities of HA products:
VMware High Availability: Assuming you can reliably detect a VM OS or application service (for example, database software) failure, the VM still needs to be restarted. A few minutes gone is a nine lost.
VMware Fault Tolerance: This feature runs two concurrent copies of the same VM on two hosts. It’s a perfect solution for a short-term problem, i.e., I don’t want my long batch processing to be interrupted by a hardware failure. The problem? If the VM or its software crashes, both copies of the VM will crash concurrently. Great.
High-availability clusters: While strategies like Windows Server Failover Clustering restart a failed service (for example, the SQL server) on the same or another server, the restart can take from a few seconds to a few minutes—or sometimes even longer if the database has to do extensive recovery. A nine lost.
Now let me give you another data point: We recently experienced a forwarding loop caused by an intra-site STP failure. Recovery time: Close to 30 minutes with NMS noticing the problem immediately and an operator being available on-site. Admittedly some of that time was spent collecting evidence for post-mortem analysis.
Next fact: Bridging between data centers might cause long-distance forwarding loops, or you might see the flood of traffic caused by a forwarding loop spilled over the WAN link into the other data center, killing all other inter-DC traffic—including cluster heartbeats if you’re brave enough to use long-distance clusters, and storage replication. Are you really willing to risk your whole IT infrastructure to support an application that cannot achieve more than 3.5 nines anyway? After all, one would hope your server admins do patch the servers—and patches do require an occasional restart, don’t they?
Moral of the story: “Magic” products give you a false sense of security; good application architecture and use of truly highly-available products —like MySQL database cluster—are the only true solution to the high-availability challenge.
About the author: Ivan Pepelnjak, CCIE No. 1354, is a 25-year veteran of the networking industry. He has more than 10 years of experience in designing, installing, troubleshooting and operating large service provider and enterprise WAN and LAN networks and is currently chief technology advisor at NIL Data Communications, focusing on advanced IP-based networks and Web technologies. His books include MPLS and VPN Architectures and EIGRP Network Design. Check out his IOS Hints blog.