As to the question of "what to monitor," this is a matter of philosophy on one hand, and a matter of what you have available to do the monitoring with on the other.
For example, if you take the end-to-end principle to heart (which I would urge you to understand well even if you don't apply it religiously), then you have define "end-to-end" for your purposes. For example, if you are primarily offering Web services and your customers/users are accessing the services via browser, then end-to-end extends from the user's desktop starting at the application (the browser), including their OS, NIC driver, LAN, ISP connection, all the way to your Web server, and then back through all the tiers of your Web system, quite likely all the way to your backend database where even disk I/O is important.
Some of this path you can monitor. Some of it you can't. The more you can see, the better. However, you may wish to declare some reasonable boundaries that you are not going to cross. For example, you may wish to disavow any responsibility for the user's desktop. But when you establish the boundaries, you must be able at least to distinguish which side of the boundary a problem may lie on.
A useful demarcation is between the network (say Layer 3 and below) and the application (Layers 4-7). Another useful distinction is between factors (hardware, OS, peripherals, applications) on a specific host versus anything else (such as any other host connecting to it and the network itself). Yet another is between the system on a specific host and an application it is hosting.
Making these useful distinctions requires that you have the means to do so. For example, sniffing packets at your gateway/firewall can be helpful in distinguishing the behaviors of different applications. Recording CPU cycles spent on different application tasks on a particular host can isolate the effects of disk latency, or bus timeouts, or paging faults. And actively probing from end-to-end at Layer 3 can separate out the effects of the application/host from the network connection.
So what is your mission-critical definition of end-to-end? And what tools do you need to monitor the important features of that path?
Without knowing your network world personally, the obvious starting points are to monitor
These are the elements that are common to almost anyone with a network. Mileage after that varies depending on your applications, your network configuration, and your end-users.
I hope that I haven't just answered your fair question with more questions.
This was first published in January 2004