Your network is like an onion -- a series of concentric bottlenecks, located at different components and at different Layers, along the end-to-end path. To improve your application performance, you need to identify the "narrowest" bottleneck that defines the end-to-end performance and remove it. And then remove the next one. And the one after that, until finally you achieve optimal network performance.
Your current end-to-end performance on any given network path, for a particular application, is a function of the "narrowest" bottleneck. It may be the Ethernet capacity -- and limited by design. Or it may be some sort of network dysfunction that keeps the path from reaching its optimal functionality -- such as duplex conflict. Or it may be an unanticipated consequence of a very large latency and TCP slow start (e.g. bandwidth-delay product).
Each network path may have a different pre-dominant bottleneck. Or it may be the same bottleneck that is shared by many paths. Regardless, once that bottleneck is identified and removed, the end-to-end performance will increase until it reaches the next bottleneck.
Let's consider a really simple example where the only limiter is the Layer 2 capacity -- suppose it is the 10 Mbps of an Ethernet link that defines how quickly an FTP transfer takes place. So identified, that link can be upgraded to Fast Ethernet, and thus the transfers will jump nearly 10-fold in performance. And then removing that limit might mean going to Gigabit Ethernet.
In the real world, there are many different types of bottlenecks and we know that it is rarely as simple as bandwidth. In fact, bandwidth is almost never the problem although it is most often suggested as the solution to a performance issue. Performance bottlenecks are usually found elsewhere in the path or at another layer.
Here are a few likely causes:
- Deliberate rate-limiting: Cisco's Committed Access Rate (CAR) rate-limiting mechanism limits the maximum rate that packets can be forwarded in the mid-path. It was one of their earliest rate-limiting mechanisms and is still in wide use today. It assumes a Reno-style TCP stack controls the rate of transmission of a particular stream and starts to drop packets whenever the stream exceeds a specific instantaneous rate – dropped packets signals congestion to the application and it should slow down. Sometimes CAR drops according to bytes per second or number of packets per second. Talk to your ISP about whether it should be in use in your network.
- Interrupt limiting: At Gigabit Ethernet rates, when relatively small packets are used (much smaller than 1500 bytes), the rate of arrival of Layer 3 packet headers can overwhelm a NIC to the point that it begins to drop packets. This is not overflow due to overall payload but a limit on the processing of the individual headers. Larger packets may arrive more quickly with no loss at all. Avoid use of small packets and/or consider jumbo packets.
- NIC buffers: NIC tuning is sometimes required in order to avoid cases where packets overwhelm the send and receive buffers at the NIC. Particularly when attempting to achieve full 1000 Mbps capacity on Gigabit, the buffers must be increased to 3-5 Mbs. Your mileage may vary but this is the typical GigE bottleneck at Layer 2/3.
- Bandwidth-Delay product: Even though the capacity is known to be relatively high, you may experience quite low overall throughput over WAN links. This is a consequence of the TCP behavior interacting with the available bandwidth at large latencies. This is a typical Layer 4 bottleneck on very long links. Increasing the TCP buffer is the best solution.
- NIC and driver: Often a newly installed system offers down-level drivers. In some cases, they can limit how quickly they put packets on the wire. For example, the NIC may say 100 Mbps but the application can at best see 70 Mbps. Upgrading the driver often solves this problem.
- Disk I/O: If the application using the end-to-end path is data intensive, transferring off and on the disk may be the limiting factor. In this case, the correct definition of "end-to-end" should include the disk array. Instead of pumping the network full of more bandwidth, the best solution will be to find faster disks and/or a better RAID configuration.
- Dysfunctions: There are endless, unforeseen problems that can plague a network and reduce performance -- duplex conflicts, pMTU issues, cable problems, and route flapping are just a few possibilities. These all need to be identified and resolved -- they just shouldn't exist in a healthy network of any sort.
When you start to think in terms of an onion, you can see that peeling away your bottlenecks one at a time can quickly get you to the level of performance you designed into your network. The obvious problems just aren't the usual suspects. The weakest link in the chain -- the pre-dominant bottleneck -- is the most important thing to find and remove and it may not be where you think it is.
Think onions -- the jump in network performance will bring tears to your eyes…
Chief Scientist for Apparent Networks, Loki Jorgenson, PhD, has been active in computation, physics and mathematics, scientific visualization, and simulation for over 18 years. Trained in computational physics at Queen's and McGill universities, he has published in areas as diverse as philosophy, graphics, educational technologies, statistical mechanics, logic and number theory. Also, he acts as Adjunct Professor of Mathematics at Simon Fraser University where he co-founded the Center for Experimental and Constructive Mathematics (CECM). He has headed research in numerous academic projects from high-performance computing to digital publishing, working closely with private sector partners and government. At Apparent Networks Inc., Jorgenson leads network research in high performance, wireless, VoIP and other application performance, typically through practical collaboration with academic organizations and other thought leaders such as BCnet, Texas A&M, CANARIE, and Internet2. www.apparentnetworks.com