To address our methods, we are using Concord Network Health reporting. Concord uses 5 minute polling samples of an "element", which in this case is a core interface, or the hub side of a WAN circuit. The "anchor" information is the port speed. Port speed (128K) is compared to the variable bps at the time the sample is taken and a utilization % is derived. We calculate both input, output because of the vast differences in traffic to and from the core. Mostly we focus on the utilization of traffic back to a remote location due to the nature of our applications. Large report requests, Web downloads, etc. drive utilization to higher numbers.
We arrived at the current comparison metrics by "guessing" at the threshold for application performance or response time change. Since we are a service provider, we wanted to optimize our time and perform a 5,000 foot view. This method allows us to dig deeper if consistent normal traffic flow percentages are higher, or move on to other possible causes of poor application performance.
...<38%, as a base, seemed like a safe threshold because of the 10 hour time frame and the fact that the morning hours could be highly utilized, but afternoon hours may be lower. Rather than spending on upgrades perhaps scheduling changes could help keep costs down. The other percentages just step up from there. The flaw in this thinking is if in fact the morning hours show very high utilization, and normal business practices cannot change.
What is important to us is that application response times are reasonable. In my experience there are many causes of poor application performance. Our goal in measuring WAN utilization is to eliminate WAN as a factor in poor app performance, so that we can focus on client, host processing, or application engineering flaws (I know - spoken like a true network person.)
Thanks again for your time.
More from Dr. Jorgenson
Your methodology makes good sense. For file transfers you should be able to develop reasonable response criteria – it is hard for me to comment on your numbers though – I would have to get deep into the details with you.
I would make one additional comment though. I would not look just at averages over long periods (like hours) – I would look at the distribution of local maxima and minima. Or alternately, identify local maximums (minimums) and measure percentage of time above (below) some selected thresholds.
What does that get you? If you pick a reasonable threshold for the maximum and a tolerance/response time for the application, you will be able to establish a metric for the network response – the rate at which it fails to respond at an acceptable level. That measure may be quite different from a global average for the reasons that I mentioned in my first response (burst traffic). It is the rate of occurrence of the instantaneous utilization exceeding an acceptable threshold. Or, in other words, how often an application experienced a WAN link that was not performing within specification.
Here's where a graph could be really handy.
I trust that you get my meaning and that this helps.