Network performance management software providers are looking for ways to share the information their technologies collect across silos so that IT organizations can better collaborate.
The dreaded triage meeting is known all too well by network managers. IT professionals from multiple domains gather in a room with reports from their management tools that say everything is fine on their end. But the network manager's troubleshooting tools say there is a problem somewhere. A meeting can quickly descend into a finger-pointing free-for-all, and eventually those fingers all start to point at the networking team.
"Nobody knows where the source of an issue is," said Jim Frey, senior analyst with Enterprise Management Associates. "Everyone is looking at his own domain, whether it's systems, storage, databases. All their tools say that everything looks fine. But all the connected elements are not doing their job together."
If the management tools that storage managers, system administrators and database administrators use all show green lights, then logic dictates that it must be a network problem, even if that logic is flawed.
Network managers can prevent the fingers from being pointed at them if they can find a way of getting the right information into the hands of their cross-domain counterparts. Frey said that the network is an objective source of information that can help other domains in IT understand what's causing trouble.
"You can look at the network as the traffic cop looking down from above," he said. "It can see things like congestion, and it can check on the efficiency of flows from point A to point B to point C to point D. There's a lot of value to be gained in letting network guys, if they have performance data, take the ball when you can't find an easy answer. They can usually help you direct and focus your efforts to your most likely failure domain.
Many traditional network performance management software vendors, such as NetQoS and NetScout, are trying to make their products more relevant to non-networking professionals, Frey said, so that network managers can share reports and alerts with other IT teams.
Last month, NetQoS released version 5.0 of its Performance Manager product suite. The newest version contains some enhancements aimed at providing relevant information to people other than network engineers, particularly Application Performance Dashboard, which serves up information that is easily consumed by IT professionals across multiple domains.
"[It is] a new visualization that can provide access not only to data as far as network engineers go, but so that anybody in an organization can understand that there is an issue with performance associated with critical applications," said John Mao, product manager for NetQoS. "Most traditional vendors have some kind of application dashboard. We go a step further with this and show not only that there is an issue and who is being affected, but which component is responsible for that particular performance degradation."
Mao said NetQoS users can go and click from the dashboard into a troubleshooting exercise and examine how an application is communicating from one tier in the IT infrastructure to the next and drill down to the source of a problem.
"This is a concept of summarizing and abstracting information for a wider audience and providing repeatable workflow and repeatable processes for troubleshooting the actual issue," he said. "In the past, the workflow we provided for troubleshooting was geared toward engineers with a lot of line graphs and lots of different presentations and visualizations that the engineer would be accustomed to. We're trying to introduce some concepts that we can lead into some adjacent groups within IT as well. They use the same underlying data, but it's a matter of how we present that data."
Josh Hinkle, manager of datacenter, network and security at the American Heart Association, said he has been using NetQoS's dashboard to clean up a lot of the confusion that arises while troubleshooting incidents. He said network troubleshooting tools are evolving in a way to promote collaboration among IT domains.
"Instead of sitting back there and troubleshooting when the network is down, there is a lot more emphasis on proactively indentifying and communicating issues and collaborating on the fix," Hinkle said.
He described NetQoS as primarily a network team tool. "But we are providing reports and alerts to our technical partners, the application and help desk folks primarily," he said.
Frey said some network management startups have collaboration in mind from the start.
"I think ExtraHop is doing a great job of jumping right in and giving details that both the network people want to see, as well as the application support people want to see," he said. "And by providing that combined viewpoint, they really get buy-in from both those teams."
ExtraHop sells a passive network appliance that provides Layer 2 through Layer 7 visibility by listening to and recording every transaction that occurs on the network. It then analyzes the data and provides reporting to both application and network teams about how each tier in an application environment is affecting performance.
The idea behind this approach is that application managers are starved for actual production metrics, according to ExtraHop CEO Jesse Rothstein.
"In many instances, [the application team] has access to staging servers or test servers where they can run instrumented code or performance agents or SQL profilers," Rothstein said. "But all too often in the production environment, they're flying blind."
ExtraHop customers see an immediate difference in their troubleshooting approach as a result of the appliance.
"Both the application team and the network team really love us for giving them a view into the other side of the world," said Helen Tang, ExtraHop's vice president of marketing. "One customer told us: 'After we brought in ExtraHop, it changed the entire tone of the troubleshooting meetings. It used to be just a bunch of finger-pointing.… It's not us, so it's got to be them. Now it's, OK, we have all this data in front of us that tells us what really happened. Now let's figure out who can really fix it.'"
ExtraHop has built a set of widgets that allow IT organizations to build dashboards that pivot around the role of the users, allowing them to see the information that is most relevant to what they do. For instance, an organization can build a dashboard for database administrators in ExtraHop, Tang said. And the entire set of what the users see is performance information relevant to the databases they manage, but the information is gleaned from the network.
"I think enterprises are responding to this," Frey said. "It's not the majority yet. EMA did research last summer, and we found that 58% of organizations were still doing triage on an ad hoc basis, pulling together whoever they thought were the right people. But along the same lines, the network management teams were present at these triage processes 96% of the time, and 65% of the time they were calling these meetings together and were chiefly responsible for the process. So the network guys are often on the hot seat in these processes. I think network management vendors are doing the right thing by trying to develop that role."
ExtraHop has recently added a new storage module to its product, which tracks CIFS and ISCSI information and tracks how storage issues can affect performance across the network.
"With the storage module, we can provide visibility into what access patterns might benefit from WAN optimization," Rothstein said. "If we can see that the same files are being accessed over and over again, we can say that's a candidate for WAN optimization. If we see access coming in from far away across the WAN, that's a candidate for optimization. If we see excessive locking in an application, it might benefit from some of the protocol optimization that some WAN optimization vendors like Riverbed provide.
"We had one customer that was running a beta version of our storage module," Rothstein said. "They saw some degraded database performance during peak hours. When we zoomed out to see what was going on, there was a large CIFS file transfer from a database server. Using the storage module, we were able to see that it was actually an offsite disaster recovery backup that was scheduled to occur during peak time. Once they knew about it, they were able to correct the configuration."
Let us know what you think about the story; email: Shamus McGillicuddy, News Editor