Anyone who has ever studied or analyzed event logs knows that a single fault can generate many, many events. It's not uncommon for a wrong setting to fill an event log quickly, which by itself can cause problems if there isn't an administrator nearby or on call to attend to the problem at hand. The trouble is that many events don't really require immediate resolution, while others do. Therefore, it is good practice to define and implement a set of policies for alert notification and action, and then escalation when the action doesn't resolve the problem in the intended time.
It would be tedious to try and define all events that require resolution, and even more tedious to create a policy to implement your intended action. Few companies have the resources to create this type of software, and most turn to commercial solutions to solve this problem. One example of this kind of software is SiteScope which monitors a number of performance metrics such as application servers, page download times, errors (a measure of network failures), available network bandwidth, and custom monitors you define. SiteScope doesn't use an agent, although many applications that perform a monitoring and alert function do. When a problem is detected SiteScope alerts staff by page, cell phone, or e-mail. Scripted recovery is also possible. Here SiteScope is measuring the actual problem, something that is somewhat easier than event analysis.
Any approach to alert escalation starts with considering the level of threat the problem creates, and the range of actions that are required. For each range of actions the escalation policy should start with an action requiring the least resources and progress to more costly resources over time. For example, a critical network failure is detected which threatens to take down your network backbone. An alert is sent to your junior staff member. A certain number of monitor events for this problem is allowed to occur before another alert is sent not only to the junior staff member, but to their superior as well. The situation continues to escalate until perhaps an outside source with 24X7 coverage is called in solve the problem – the most costly fix.
Barrie Sosinsky is president of consulting company Sosinsky and Associates (Medfield MA). He has written extensively on a variety of computer topics. His company specializes in custom software (database and Web related), training and technical documentation.
This was first published in May 2003