When storage performance is the problem
When companies are pressed for increased storage performance, they tend to throw more disks at the problem, but that may not be the answer. One reason IT adds disks is to improve storage systems' performance. By spreading reads out across more spindles, IT can increase the rate (bits per second) at which data can be read back. Although this can deliver the performance needed, it results in storage sprawl -- very low utilization of each disk as small amounts of data are spread across a large number of spindles.
Instead of adding disks simply to improve file read times, networking professionals can look at alternatives such as caching. This might be accomplished by increasing the amount of cache on existing storage or by adding in another layer of caching, as provided by Schooner Technologies or Gear6.
Data proliferation can equal storage sprawl
IT systems are still permeating business processes in most organizations. While IT systems are present in all parts of the organization, they are not yet in every component of all of the business processes. And, of course, new processes crop up all the time, as new business activities arise.
As an organization launches new computerized processes, it also generates new data streams. This is especially true where paper forms become digital or -- worse, from a storage perspective -- where paper forms remain and are digitized. Digitizing paper forms creates both a stream of active data (what was filled out on the form) and a blob of permanently unchanging data (the digital copy, preserved for archival reasons). Hospitals shifting to electronic healthcare records are a prime example of this kind of changeover.
What's more, processes are not static. Organizations often bring more data into a process over time than was associated with it at the start. Digitized images of X-rays get tacked onto the radiologist's assessment of the X-rays, for example, as that assessment is passed along to an insurer.
Organizations retain more data all the time, whether in response to changing operational needs, compliance requirements, or as a legal insurance policy. They retain data longer, too. Increasingly, organizations retain everything forever to avoid the risk of deleting something that could be relevant in a lawsuit.
Improving storage performance
Many data-center managers we work with are of the opinion that as long as there is room for more disks or disk arrays, they should simply throw disk at their storage problems. This is the wrong approach. Disk is cheap, to be sure, but storage management is not. The overhead associated with a continually expanding storage environment is significant. And, once you run out of space, you finally have to face the problem head on -- so it is better to do so before you lose the option to add more disks in a pinch.
To get more work out of your storage, start by recognizing that you are probably storing lots of stuff you don't want, or shouldn't. Storing the same documents over and over again is a primary cause of superfluous storage in most organizations. There are two fundamental approaches to avoiding this: deduplication and content management.
Storage management strategies: Deduplication and content management
Deduplication systems process storage traffic as it heads to disk (or tape) in order to see whether it has already been saved. When the data is already stored, the deduplication system can replace the new copy with a pointer to the old copy. Storage volumes can be reduced from 30% to 90%, depending on the types of files in use and the work environment. IBM, Data Domain (which EMC and NetApp are fighting to acquire), Kazeon, and FalconStor are dedupe vendors specializing in different aspects of the problem (backup vs. disk, for example).
Content management addresses the same issue from the user side by trying to prevent wasteful storage before bits head to disk. With a content-management system, an organization can build a library of documents and put limits and process around the creation of new versions of those documents. Big content management vendors, such as EMC (Documentum) and IBM (Filenet), dominate the space. Smaller vendors like Xythos, freeware such as Al Fresco, and pay-as-you-go SaaS providers such as SpringCM provide lower-cost alternatives for smaller shops and tighter budgets. Content-management systems also help organizations decide whether and when it is time to archive or delete documents and tag them for retention for a set duration.
Whether addressed behind the scenes by deduplication or as a part of improved processes using content management, controlling the growth of demand is key to controlling storage sprawl and improving storage performance.
About the author
John Burke is a principal research analyst with Nemertes Research, where he focuses on software-oriented architectures and management.
This was first published in July 2009