Network error events can be simple things like a printer runs out of paper or a network connection fails, or more complex things such as a network service failure. Depending on the kind of failure, your log files will start to generate multiple and often increasing numbers of events related to the failure. The first part of an event escalation policy is to be able to analyze what the causal event in a cascading and expanding tree of events is in order to reach the root problem. This isn't as easy as it might sound.
Consider the problem of a downed network connection. That connection might generate a print error if the connection was to the print server. The printer's fine, but you get a communication error instead. If the printer was simply out of paper that would be easy to understand, but a communications error is more ambiguous. What's required to solve escalating events is a rule based set of policies that logically addresses the relationships between events. Thus you might write a rule so that when a connection error occurs to a specific system; than all other services supplied by that system are ignored (or suppressed) until that problem is isolated.
It's not really possible to write all the rules you might want to handle escalating events. If you have a rule that states something simple such as an event of a certain type that isn't solved within 60 minutes is sent to (escalated) to the next level of support that's pretty easy to implement. But for complex
Since event escalation software can locate problems and their root causes they can save your staff many hours of hard work, and while they may be expensive they typically have a strong ROI.
Barrie Sosinsky is president of consulting company Sosinsky and Associates (Medfield MA). He has written extensively on a variety of computer topics. His company specializes in custom software (database and Web related), training and technical documentation.
This was first published in February 2005