"The problem with enterprise caching and memcache specifically is there [isn't] a query mechanism for the cache [so they don't] incrementally go to each server and ask it what it's caching in each slot that it has," said Drew Garner, director of architecture services at Redmond, Wash.-based Concur Technologies Inc.
Concur processes two billion SQL queries and 500 million memcache transactions per day. Optimizing the performance of a database infrastructure of that size requires a more scalable application performance management system.
"We implemented a 60-node [memcache] cluster that stores several terabytes of memory. We needed to track down if we had specific keys that were not being dropped, or keys that were being rendered slow, or cache servers that weren't providing fast enough responses. For rendering stuff, 0.6 milliseconds is our round-trip time for a typical operation. We needed something that could handle that," Garner said.
More on network-based application performance management
Network diagnostics that see through virtualization
Combining NetFlow and packet analysis boosts network visibility
Application performance monitoring options: From storage tier to the end user
Concur deployed Extrahop's network-based application performance management appliances on the network, plugging them into the data center's existing network tap infrastructure for passive network monitoring. Unlike most application performance management systems, Extrahop is a passive system that doesn't require agent software. And unlike other network performance management systems that track application performance, Extrahop performs more than Layer 2-4 network-level analysis. It also inspects application layer (Layer 7) metrics in real time, which allow it to directly identify events like server errors, expired SSL certificates and slow database response times.
Extrahop alerts Garner's operations team before many of Concur's other monitoring products pick up on anything, he said.
"So when a particular [customer] is having an issue where cached items aren't expiring correctly, and we think it's due to a software bug, [Extrahop] is able to parse through about 600 million operations that we do on that tier to see where the problem [is]," Garner said.
Catching a logging bug avoids network upgrade
Concur processes billions of sensitive transactions per day for its customers with the company's servers logging many of those transactions. This logging produces additional traffic, which is typically fine, except when a software bug disrupts things.
"We had a bug in the logging [for the Concur system]. It was essentially triple logging a lot of different entries in the system, and logging kind of ballooned out of nowhere," Garner said.
The network engineering team initially identified the growth in bandwidth demand set off by the logging bug as organic application transaction traffic and prepared to upgrade portions of the server access layer of the network to 10 Gigabit Ethernet (GbE). But then Extrahop identified the logging bug in Layer 7, allowing the company to delay an upgrade.
"We'll still eventually need 10 Gigabit interfaces," Garner said. "But as a former senior network engineer and network architect, I would tell you that when we didn't have any insight into why an application was requesting bandwidth, we were just like an electricity service provider. I don't care that your fridge is broken and is requesting double the wattage that it needs. I just provide double the wattage. I think that's the most important thing about Extrahop. It gives us context about who is using the network and why they are using it. Then we can dig down and say, 'These are database calls; these are cache calls.'"
Eyeballing packet captures a thing of the past
In Concur's large-scale, multi-tiered application environment, network engineers once spent much of their time analyzing packet captures from multiple machines, searching for root causes of network performance problems. The Extrahop application performance management system has changed that.
"It's helped give them their sanity back and get away from single-point-in-time packet captures on multiple machines that had to be correlated," Garner said. "We didn't have a system that did what Extrahop does on the back end today. It was a lot of piecemeal processes: 'We think there are errors there, let's do a packet capture.' It was just really cumbersome. Even though you had an analysis engine [like Wireshark], it didn't have a link to real-time data. It was a process of pick the data you want to analyze, collect gigs and gigs of it, and feed it into this other system until it spits out what it thinks the issues are."
Now Extrahop's technology has penetrated other silos of the IT organization, Garner said.
"It's used all the way through to our software development team," he said. "So when we do one of our monthly updates and there are profile changes in terms of SQL queries or cache calls, they can make sure it had the impact they wanted. It definitely did start with networking, but now it's networking, storage, servers. It's across everything."
Let us know what you think about the story; email: Shamus McGillicuddy, News Director.