BACKGROUND IMAGE: iSTOCK/GETTY IMAGES
As enterprises get their arms around working with hybrid cloud architectures, the issue of managing individual cloud applications in a hybrid resource pool quickly comes to the forefront.
Cloud applications decouple application components from specific server and storage resources to create a "pool' that can be employed to maximize utilization and reduce costs. A company can contribute its own servers to the pool to create a private cloud, and it can also use one or more public cloud providers to host non-mission-critical applications or act as an overflow or failover resource in case of problems with its own data centers.
Managing and monitoring hybrid cloud resource pools to make sure that worker productivity isn’t impacted by performance issues is a three-step process:
- Set a measurable Quality of Experience (QoE) goal.
- Organize your monitoring/management resources to fully isolate problems.
- Target remedial action at the real problem source, even in a virtual/cloud world.
The most important step in hybrid cloud management and monitoring is measuring QoE response time at the user's point of connection. This can be obtained from the local application component, client device or local network connection. What is important is the ability to measure total application response time and packet-loss rate. These pieces of information will be used in conjunction with data from other monitoring sources to fix problems.
Where the user accesses applications on smart devices, management tools on the client device will probably be able to provide response-time data. The IT shop has access to the data in the device, either because at best, the device has mobile device management (MDM) features they can draw on, or at worst, the application on the device could include the response-time data in the message flow it generates to the application.
If that doesn't work, it’s always possible to measure response time with monitors or probes. No matter how you get the data, the key point is that if you know the acceptable response times and the point at which operations become unacceptable, you have a baseline for resource management and monitoring.
Taking inventory of available management data
The second step in QoE management/monitoring is to take inventory of the management data available from network and cloud providers. As before, the objective is to see what delay and packet loss information is available and what resources or network connections the data represents. Expect considerable variation here among providers, even in terms of how the data is reported and interpreted, so be prepared to do some work reducing all the information to common metrics. There is no general solution because every network and cloud provider will have somewhat different information and formats, so there’s no option but to do a bit of customization to create the needed data elements.
On the network side, it's good to assess whether standard network tools like ping and traceroute will work. Both of these protocols provide basic reachability, response time and hop data (using traceroute) on the path between two points. Looking into the network from the user connection point to the application, they can help find network delays or unusual packet routings that can indicate an internal network problem. In order for these to work, however, they have to be supported both at the application side of the connection and in the network connection itself. Things like Network Address Translation (NAT) and load balancing can impact ping/traceroute value so testing this during a project pilot is important.
Available management data from cloud providers
On the cloud side, cloud providers’ management data varies depending on the type of X as a service offered. Many Platform as a Service (PaaS) and Software as a Service (SaaS) providers will be able to offer some data on the cloud-to-network interface and on internal application resources because the applications are using cloud operating system and middleware components that often have management interfaces. With Infrastructure as a Service (IaaS), the application’s software platform is provided as part of the user’s machine image, which means management tools must be built into the machine image to be available.
Application monitoring and management, and even some network management tools, can often be incorporated into application middleware and deployed on IaaS services to gain better visibility and control. But it’s important to check with the cloud provider to ensure that the tools will work on virtual resources. In many cases, simple applications to echo packets to measure response time can also be added, if ping/traceroute isn’t satisfactory or supported. Unfortunately, there are no real standards for cloud management, even to the extent of defining what information is available. IT has to pick tools based on their familiarity and needs, which generally means any tool that can be integrated with the application image and uses available APIs (on PaaS) will work.
When all of the management data is assembled and converted into a common format, fault isolation and remediation normally begins when user response times rise more than a predetermined amount. The first step is to determine if there is unusual packet loss since lost packets will not only have to be retransmitted, they often reset the flow control protocols and thus may reduce connection performance. Packet loss is most easily detected by looking for retransmissions or flow-control changes, either in the application’s network middleware or in the client device. If losses can be eliminated as a cause, the next step is to look for network delay, followed by processing delay.
Packet loss is a result of congestion en route, so fixing packet loss will often involve either rerouting packets or increasing network performance. In either case, it will be necessary to work with the network provider(s) to resolve the problem. Packet delays in the network are usually associated with an excess of “hops” between routers along the path, a measure of an inefficient route. If packet routing changes because of a network problem, it will likely be restored to normal in time, but persistent problems with excessive route hops may indicate the provider isn’t able to provide an efficient connection to cloud resources. With VPN services, it’s often possible to reroute VPN connections to reduce hops, but with Internet services, the only option may be to change ISPs.
Take the quiz
Check if you're up to speed on understand how to exercise basic functions of management, monitoring, load balancing, and connectivity control with this eight-question quiz.
When neither packet loss nor packet routing delay is at fault, the only remaining variable is processing time, which can be impacted by the loading on the server used to run the application, the storage used, and the application design. Issues with cloud application performance can be traced to colliding resource requirements from other cloud users, inadequate resources allocated to the application in the cloud contract, failure of the provider to meet the service level agreement (SLA), or simply an excess of demand on the application. It may be possible to tune cloud resource allocation through the cloud management interface or to launch multiple instances of the application to improve response time. This will mean adding a form of load balancing to the application, something best handled by the cloud provider or cloud-hosted software.
Professionals in the IT and network operations areas understand problem isolation and resolution processes where resources are dedicated to applications. With some care, those same principles can support hybrid resource pools and ensure application QoE in the cloud.
Tom Nolle is a strategic egghead -- someone who first wants to know the truth, no matter what it is, and then wants to explain it in a way that reaches everyone who cares to know it. He's an analyst in telecommunications, media and technology, and a former software architect who now works to blend technology detail and business reality.
Hybrid cloud models as reality in enterprise IT
Hybrid cloud management and integration challenges
How to manage the hybrid cloud model
Bringing hybrid cloud networking to maturity