Wednesday, July 8, 2009

The drawbacks to data reduction

July 8, 2009 – Data reduction, or capacity optimization, has succeeded in the backup/archive space (i.e., secondary storage), but applying data reduction techniques such as a deduplication and/or compression to primary storage is a horse of a different color. This is why the leading vendors in data deduplication for secondary storage (e.g., Data Domain, EMC, IBM, FalconStor, etc.) are not the same players as we find in the market for data reduction on primary storage.

A lot of articles have been written about primary storage optimization (as the Taneja Group consulting firm refers to it), but most of them focus on the advantages while ignoring the ‘gotchas’ associated with the technology. InfoStor (me, in particular) has been guilty of this (see “Consider data reduction for primary storage” ).

In that article, I focused on the advantages of data reduction for primary storage, and introduced the key players (NetApp, EMC, Ocarina, Storwize, Hifn/Exar, and greenBytes) and their different approaches to capacity optimization. But I didn’t get into the drawbacks.

In a recent blog post, president and founder Dave Vellante drills into the drawbacks associated with data reduction on primary storage (which Wikibon refers to broadly as “online or primary data compression”).

Vellante divides the market into three approaches:

--“Data deduplication light” approaches such as those used by NetApp and EMC
--Host-managed data reduction (e.g., Ocarina Networks)
--In-line data compression (e.g., Storwize)

All of these approaches have the same benefits (reduced capacity and costs), but each has a few drawbacks. Recommended reading: “Pitfalls of compressing online storage.”

1 comment:

Sunshine said...

Interesting post, Dave. Here are a few thoughts, as expressed by a few of the folks at Ocarina.

1. Although there’s lots of talk about filer performance, very few applications push that envelope. Where they do, in-band solutions are the way to go. But for most storage admins, efficiency is a higher priority than maximizing performance.
2. There’s no disputing the data aging curve; after a couple days of inactivity, IO requirements on a given file drop dramatically. As long as the post-process reduction solution includes basic filters and policies, this can be precisely tuned to minimize or eliminate performance impact.

Although we like the benefits the post-processing architecture because of its more powerful impact on large data-stores, the important trend to watch will be end-to-end optimization, where the movement of data between tiers, across the WAN, and to backup solutions doesn’t require wasteful rehydration and re-assembly of files.

For more on this, Carter George has written a post that follows up:

Sunshine Mugrabi