Any enterprise running a high-availability application must answer the following fundamental question: How do we...
create a resilient application architecture when the underlying communications infrastructure may be unreliable?
Here's some background. My consulting firm was working with a customer whose primary business application had high-availability requirements. The company's clients sent transactions to the primary data center application server and buffered the transaction until a confirmation was received. The customer had two data centers configured as primary and backup.
In terms of reliability, the client also experienced network-related outages several times a year, and the failover mechanism to switch from the primary to the backup data center was a manual process that took hours to execute. As a result, the network problems were fixed before the failover process had been completed. Clearly, the client needed a more reliable data center failover mechanism that would enable clients to access the high-availability application.
One option was to make the network and data centers highly reliable, so an outage of any data center would be extremely rare. The architecture of highly reliable infrastructure tends to be fragile, however, and small changes can result in outages that are difficult to diagnose and correct.
Resilient application architecture
To avoid making the systems fragile, a better approach to resilient applications would be to deploy an active-active data center architecture that doesn't rely on a single path or function. The term active-active refers to the use of at least two data centers where both can service an application at any time, so each functions as an active application site. The clients can perform their transactions at any active data center, and the design and operation of each data center can be much simpler than trying to create a single, super-reliable data center.
Note that the resilience should be built into the application and not the network and IT infrastructure. This means the application continues to be accessible even if parts of the network or servers fail unexpectedly. Central to this methodology is the high-availability application architecture itself needs to encompass reliable data exchange. Implicit in this architecture is the databases at each active data center need to update each other as client transactions are executed.
The customer's application characteristics were well-suited to an active-active architecture in which either data center could execute a full transaction. Customer transactions were sent to a data center application that updated a central database, then sent an acknowledgement to the customer endpoint. The mechanism guaranteed delivery of the transaction. And because the high-availability application was internally developed, any subsequent modifications could be made in-house.
TCP for data delivery?
TCP is the network mechanism engineered to ensure reliable data delivery. But while TCP can retry delivery of a dropped packet, it can't guarantee data delivery when one of the endpoints fails. The TCP session is established between the interfaces of two endpoints. And if one of the endpoints -- a server or its interface -- fails, the TCP session is terminated.
Lessons from unicorn companies
IT systems of so-called unicorn companies, like Facebook, Google, Microsoft, Netflix and Amazon, are designed to allow clients to connect to any of their data centers. If any element within a data center fails, transactions attempting to use that component will automatically be assigned to a different part of the IT infrastructure. Companies like these expect portions of their infrastructure to fail, so they build resilience into the applications themselves.
Resilient architectures for the rest of us
If you don't work for a unicorn company, what can you do? We can learn from the unicorn companies and modify our IT systems to function in a similar manner. This works best for high-availability applications that are built in-house.
For example, a client endpoint could use a transaction retransmit timer with a round-robin list of data center addresses, learned via domain name system -- i.e., global server load balancing. The client would buffer the transaction until it received a confirmation from a reachable data center. Database synchronization would distribute updates to other instances, so any database could handle future transactions. This architecture allows companies to deploy multiple application database systems. And this approach could even extend to access database instances in cloud infrastructures, like Amazon and Microsoft Azure.
Third-party applications -- like an electronic health record application, for example -- are more challenging. We can ask software vendors for resilient system designs capable of operating with active-active data centers. If you examine the client side of the application closely, you may find an opportunity to add a small software module that can monitor data center connectivity. If connectivity fails, the software module can then switch the application to another data center automatically.
Another option is to consider technologies like software-defined WAN, which increases path diversity through the use of multiple links from different providers. This approach will work for third-party applications, as well.
With the widespread adoption of cloud computing, it's tempting to design the system to use one in-house data center and one cloud-based data center.
High-availability application lessons
Unicorn companies offer some interesting examples to enterprise networking on how to make IT systems and applications highly available. While it may require a bit of innovation to improve applications we don't control, the good news is there are a lot of technologies that can help almost any organization improve its application resilience.