The network is the new backplane of a multi-server system; a single transaction often requires the close cooperation of multiple servers, databases and appliances connected to one another by network paths. Each of these devices is frequently updated with new software, and the network itself is always changing, with new paths, new switching devices and new configurations. It's therefore almost impossible to reproduce a modern multi-server, network-based production system in a test lab. Unfortunately, it's usually the subtleties of the production environment that cause problems in production applications.
Systems managers must therefore expect that programmers will appear in their operations center. Those programmers want to trace an individual transaction through a maze of equipment and network paths, but often the tools available in the production environment provide only summary performance metrics and, possibly, some cryptic protocol traces from a few LAN segments. They must either guess at what is happening to individual transactions by looking at the summary data or spend hours trying to find and match up pieces of the transaction flow captured by protocol tracing utilities -- an often fruitless exercise, and one that requires protocol knowledge that programmers often don't have.
Dye tracing, which can follow an individual transaction through multiple servers, gives programmers a familiar diagnostic environment. When comprehensive dye-tracing facilities have been installed, programmers can watch each program call, its timings, and its parameters, as if the entire process were contained within a single server. They can watch as a synthetic (test) transaction arrives at a server, progresses through the applications software and the database calls, and then generates a response; they can watch individual customer transactions that are having problems to see where inside the applications and the servers the difficulty or bottleneck is occurring.
Dye tracers cost money; they require software to be inserted into production servers, which adds complexity. There is also a small performance penalty. For those reasons, some production organizations resist their use. But think of the time and effort saved when there's a problem! Programmers intuitively understand a dye tracer; it's similar to the tools they use in development. They can find a problem quickly, with much less involvement of the network operations staff and without the need to obtain protocol traces quickly during a crisis -- which may itself cause problems. The development group may even be willing to share the cost of the dye-tracing system, and they may want to use it during development to tune their applications -- which will also make them familiar with it when a crisis occurs.
A dye-tracing system works by inserting a software shim between the application programs and the underlying operating system. In some dye tracers, that shim watches all of the program calls, records their response time, and may also copy some of the call parameters, such as SQL queries. To trace a process from one server to another, some dye-tracing shims can insert a tracking number into inter-processor messages. The shims in the different processors then use that tracking number and remove it from the messages before they're seen by the applications program. The entire dye-tracing system is completely transparent to the application.
Dye tracers impose a slight load on the server systems, usually a very few percent. To decrease the load further, the dye-tracing system can normally be used on only a few servers in a load-distributed environment, for only a restricted percentage of transactions, or only for tracing synthetic measurement transactions.
Different varieties of dye tracers are available from a growing number of suppliers. Examples of these tools, all with different capabilities, are CA's Introscope Transaction Tracer, HP's Transaction Analyzer, IBM Tivoli's Composite Applications Manager, OpTier's CoreFirst, Quest Software's PerformaSure, and Symphoniq's TrueVue.
About the author: Eric Siegel is a senior analyst at the Burton Group. He is a nationally known authority on Web performance measurement and optimization. He has 32 years of experience in design and evaluation of large computer networks and is the author of major portions of Burton Group's original Reference Architecture. At Burton, Eric specializes in Web and network performance optimization, SLAs, network measurement and management, and QoS.