Network traffic analysis is an indispensable network engineering tool that supports everything from rapid troubleshooting and security monitoring to capacity planning. It is also a beast, powered by an unrelenting fire hose of minute report packets that demand significant and clever post-processing to interpret. It doesn't exactly fit the definition of big data --we're not talking about exabytes -- but your NetFlow monitoring system is a microcosm of big data challenges. From the perspective of your always-limited IT resources, it can easily be the big science of network engineering if you're doing it at any scale.
When was the last time your NetFlow data ever changed?
One thing you generally don't find in big data projects is a relational database. Although relational database management systems (RDBMSes) offer amazing functionality with features like flexible interfaces, easy-to-manage persistence and great reporting, there comes a point where they hit a wall ingesting NetFlow metrics. And to understand why, you need ask only one question: When was the last time your NetFlow data ever changed?
Zero cardinality data offers unique opportunity
Big data solutions aren't really that different from the data warehouses of old except for how they store data. There's usually a huge extract, transform, load (ETL) operation or one-off feeds from many systems into a central storage facility.
But big data platforms periodically cause CPUs to choke as they plow through variations of MapReduce or open solutions like Hadoop that make data available. What both big data and data warehousing share is that they are tuned for situations where the data, ideally, doesn't change.
Once collected, NetFlow, sFlow, CFlow, JFlow or IPFIX-Flow records never change. The overhead of B-trees in relational databases makes it easy to find records for update, but it's totally wasted on historical flow data. For too many net admins, traffic data fills up their network monitoring database, and it can greatly impact the performance of everything else. Flow monitoring is no good if it breaks hardware alerts.
NoSQL Flow reporting: Challenging to create
The biggest reason some NetFlow analysis vendors stick with relational databases is that implementing a NoSQL solution simply exceeds their technical abilities. There are many excellent partial high-performance data technologies like bitmap indexing (e.g., FastBit) and open solutions like Hadoop, but FastBit is not a data store and Hadoop knows nothing of networking data.
It's no easy trick to create new technology specifically geared to network engineers that delivers fast dashboards, is thrifty with CPU and storage, and is easy to manage. It takes years of monitoring experience, long-term customer collaboration and significant investment in R&D for network management vendors to go beyond simply gluing rocket parts together. However, for those that crack the code, three main advantages are realized: much higher performance, finer reporting granularity and lower operational costs.
High performance is enabled by the combination of efficient flat files with super-compressed, in-memory indexing. In some cases, it can be an order-of-magnitude faster or even greater, on the same hardware. Less expensive detail storage allows users to adjust fact table granularity to truly support real-world help-desk resolution windows. Storing weeks of high-granularity, historical NetFlow data without the simultaneous installation of a phone-booth-size relational database server is key. How many times have you worked a ticket that leads to Quality of Service mapping issues, only to find out that the detailed flow data -- filled with useful spike data -- rolled out of the storage window? With a month of flow detail data or more, that issue evaporates.
The third advantage, and the one IT budget managers care about, is that dedicated NoSQL flow storage means reduced cost. Less expensive hardware, fewer RDBMS licenses and less complex maintenance can all make for an easier conversation with your manager.
Just in time for the holidays
A number of network traffic analysis and monitoring vendors offer solutions that scale out to even the largest enterprises by taking advantage of dedicated flow storage. These products bypass inherent relational database limitations and create the freedom to measure traffic as is should be measured --from just about everywhere in your network. With the end of the year approaching, and management out of the office, it's a great time to fire up the lab and see if a little big-data tech can teach your NetFlow a thing or two.
About the author:
Patrick Hubbard is a head geek and senior technical product marketing manager at SolarWinds. With 20 years of technical expertise and IT customer perspective, his networking management experience includes work with campus, data center and storage networks; VoIP; and virtualization, with a focus on application and service delivery in both Fortune 500 companies and startups in the high tech, transportation, financial services and telecom industries. He can be reached at Patrick.Hubbard@solarwinds.com.