When rolling out applications, there are a lot of tools that will help you understand what is required for applications to perform well. From what I can gather you have NetScout Probes on the network. Do you also export NetFlow data from the Cisco devices? There are several reporting tools on the market that would allow you to get visibility into the applications, hosts, and conversations flowing over the links.
From what I gather, you have little to no insight into the delays produced by the application, its architecture, or how much throughput it requires to transfer data. To quantifiably answer your questions, there are a lot of questions I traditionally ask about the application itself and the architecture. Oftentimes I find that many distributed applications are never tested for WAN or users with higher latencies. Because of this fact, there are various reasons why applications take longer to respond when you introduce latency. In your situation, picking up a packet analyzer and learning to analyze the throughput and delays would be advisable. Without in depth knowledge of the application, I am going to introduce a few probable reasons why the application is slower for users even though it appears that bandwidth is not a problem.
First, realize that serialization delay and queuing delay are the two components of network delay that are improved by increasing bandwidth. Serialization delay, i.e. the amount of time it takes to put the data on the wire, and queuing delay (depth of the queue) are improved by increasing the bandwidth from a 128Kbps circuit to a T1. However, three other components of delay, routing/switching delay, distance delay, and protocol delay are components that can not be positively affected by an increase in bandwidth. If the circuits are not over-utilized, then increasing the bandwidth to improve the performance of this application will only result in an increased bandwidth with no positive affects on performance. (quite possibly. It would require some data through either the NetScout Probes or NetFlow data to validate this point).
Alternatively I would recommend taking a snapshot of the packets through a port mirror on the server switches and using a port mirror to capture the data. There are a lot of great tools that would really facilitate this type of analysis and report. Like peeling an onion, unraveling why this application is behaving slowly on the network and which component is causing the delay takes a little work.
The first thing that I look at typically is how long the transactions are taking. Through the diagnostics available in most packet analyzers, I will look at the following from a single conversation standpoint:
- How long does it take for the server to start sending data across once it receives a request?
- What's the network latency? (how long does it take for the ACK to come back from a particular transfer?
- How many turns does the application take to transfer data?
With that in mind, I have diagnosed numerous application issues. There are numerous possibilities. However, I will outline a few that I have witnessed through addressing the reason why I ask those three questions above.
First, I ask about the server response time to understand how long it takes for the server to start sending data across to the client. Does this application wait for other application data before it sends the data back once requested? For example, I had a case one time where the Web server was waiting for an entire query to complete before it started sending gif, supporting images and table views to the client. By turning on a few options inside of IIS, the server started sending data to the client for the rest of the Web site while the query was processing so that it would take less time from the end-users' perspective to load the page and process orders.
Second, network latency is where I would look at particular congestion issues on the network. The thing to look for here is window resizing, lost packets or retransmissions. Are connections timing out? Investigating the latency introduced would require understanding the location of the client and following the routes and congestion statistics for each hop. In this vein, I have been able to diagnose an issue where the auto negotiation of a switch had selected 10 Mbps half duplex instead of 100 Mbps full duplex connection causing a big bottleneck once traffic reached a particular server NIC.
Understanding application turns is also an opportunity to increase the overall performance of the application without increasing bandwidth. For example, the number of turns used and whether or not the packets are fully loaded to utilize TCP windows can fully support the throughput of applications on higher latency links. For an application that requires 10 turns on a 10 ms network would complete the transaction in 100 ms. Alternatively the application that requires 10 turns on a 250 ms network would take that same transaction 2.5 seconds to complete. By looking at the number of turns required, you would get a better understanding of how "chatty" the transactions are.
There are numerous other avenues to pursue to better understand the requirements for an application. If you continue to have questions, please let me know. I can provide a few other options as well.
This was first published in November 2006