Squeezing more management from existing NMS
How can I "squeeze" more management out of my existing Network Management System (NMS), without adding more Network Management Servers? I already have 6 NMS servers and my network keeps growing. But I don't want to just continue adding more boxes.
Generally, there are several things that dictate how much server capacity you need to manage a network. The first thing is the number of devices in the network. Generally speaking, the more devices you have connected to your network, the more CPU, memory, I/O and storage capacity you'll need to manage it. The next thing that gauges how much NMS server capacity you need is how often you poll and log each device for its operational and performance information. The more often you poll a device and log its attributes, the more system resources you'll need to manage the network. All commercial NMS products allow you to tune how often you poll and log devices in your network. Most of these products have a default-polling interval of 60 seconds. The logging interval varies by product. I recommend throttling back on device polling as much as possible. For example, set the polling rate for your critical backbone devices to 1-minute and your second tier devices to 5-minute polling. Then set everything else at 15-minute or 30-minute polling. This way you reduce the amount of work the NMS has to do to manage your network. In a similar fashion, set the logging rate so that 1 out of every 10 polls gets logged to the historical database. Don't log every poll unless you absolutely have to.
Many NMS platforms by default poll a device for a considerable amount of information. In many cases, the type of information polled goes far beyond what most customers need. The more information you poll, the more expensive polling becomes for both your NMS and your network bandwidth. You should carefully examine what information is being polled, and determine if this information is relevant to your management need.
Finally, for customers with really large networks, you might want to consider a combination of device polling and device traps. Most NMS platforms relay mostly on the mechanism of device polling for both incident management and performance management of the network. That is, most NMS platforms poll each device to determine its operational state (e.g. green or red status), and to log specific performance attributes for capacity planning purposes. In this situation, the function of polling does "double-duty" -- that of primary fault detection and performance logging.
However, another mechanism exists that can reduce the amount of NMS polling and extend the management capacity of your existing NMS servers. This alternative strategy for network management relies on a smart combination of device polling, used primarily for logging performance attributes, and device traps used primarily for fault detection. Traps are SNMP messages that get sent by a network device, like a router, to an NMS. Specific events within the device can trigger a trap. The network device can be configured to send a trap message to an NMS. Most often the kind of events that can trigger a trap are associated with important state changes within the device. For example, if a power-supply in a redundant power-supply configuration should go bad within a router, the router can send a trap to the NMS which can warn an operator of this important event (normally, this event would go undetected by a device poll). Relying on traps for fault management can significantly reduce the amount of work that the NMS has to do to manage a large network. Thus, traps have the potential to significantly increase management capacity of each NMS server beyond its present configuration. However, the downside of using traps is that they can be very complicated to implement. For this reason, I advise careful planning and analysis before implementing this approach.
This was first published in April 2001