Where our team of editors discuss what they think about the current BM issues.

Deduplication is hot, and it is no secret why. Organizations are being swamped by a flood of data that must be backed up and protected. And, with only 24 hours in a day many organizations are finding they can’t back up today’s data fast enough before they are faced with tomorrow’s data.
Protecting data through backup not only is a long standing IT best practice but it has become the way organizations conduct business. The usual way organizations have protected data is through tape backup. Tape, however, has proven to be a problematic solution at best. It can be slow, tape handling is labor-intensive, and the results inconsistent. And backing up increasing volumes of data to tape during an overnight backup window is becoming increasingly difficult for many organizations.
Ideally, organizations would like to backup their data to disk. With disk backup, they experience faster, more reliable backups and recovery. They would have no difficulty backing up high volumes of data to disk during an overnight backup window. However, disk, even the lowest cost, slow spinning disk, is expensive compared to tape.
Moore’s Law says that computing power doubles every 18-24 months. Although originally used to describe the phenomenon of packing twice as many transistors onto a chip, thereby doubling its power, the law has generally proven accurate for storage as well. In the storage world, disk drive vendors have steadily increased the density of data packed onto a spinning disk, effectively lowering the per-gigabyte cost of disk storage.
Today, thanks to the power of Moore’s Law, the cost of disk storage has fallen to the point where it can be used effectively to back up some data. Although still more expensive than tape, current disk costs enable companies to assemble sizeable pools of disk storage for use in backing up their data and keeping it online for a period of time.
The issue might have stopped right there except for two things. One, the skyrocketing volumes of data organizations are facing is even outpacing the rate at which disk costs are falling. Two, storage managers began realizing that they were backing up multiple copies of the exact same data.
For example, s ay 500 people receive a company-wide e-mail with a 1 MB attachment. If each recipient saves that attachment, it is stored 500 times. During the nightly backup, a system would back up that one attachment 500 times – consuming 499 MB more backup space than necessary. Then, let’s say, the next day one person makes one small change to that file and sends the slightly changed file around to all the initial recipients who save it again. Now the organization will be backing up 499 more copies of a file that is exactly the same as the previously 500 copies with the exception of that one change. No matter how cheap storage capacity becomes, redundant data like this simply is wasteful.
Data deduplication is a method for eliminating redundant data from storage, especially from backups. It works by saving a single copy of redundant data, replacing any further instances with pointers to the saved copy. So, instead of saving and backing up 500 copies of the attachment, you save just one copy and provide a pointer to it.
Just extrapolate the example above beyond e-mail to the thousands of gigabytes of data stored and backed up every month or year. Inevitably, there is redundant data. Deduplication helps by removing the redundant data, which allows the organization to free up more storage capacity, especially for backup. By doing so, the organization can maximize the value of its disk-based backup capacity to retain more backups for a longer time on a given amount of disk space. Data deduplication can also help:
The problems caused by storing and backing up redundant data and the benefits from deduplication are so clear analysts almost unanimously cite data deduplication as a critical technology going forward. Not surprisingly, vendors are reporting surging sales of deduplication tools. One analyst firm predicted the deduplication market to go from zero just a few years ago to $1 billion in the next year or two.
The concept behind deduplication is straightforward. When data is saved or, more often, when it is backed up the deduplication tool scans the data looking for redundant segments of data. When it finds redundant data it leaves a pointer and notes it in an index. As it encounters more copies of possible redundant data, it checks the index and uses the appropriate pointer. In this way, each piece of data is stored and backed up just once. For applications and users of stored data, the process is transparent.
Although the concept is simple enough, vendors have implemented a number of ways to deduplicate data. The various approaches have different advantages and disadvantages. For that reason, organizations need to pay attention to which deduplication approach they adopt. When it comes to data deduplication, one size does not fit all.
The best approach to data deduplication depends on the size of the organization and its backup needs.
There are many vendors in the market. Most vendors offer only one method, either object-level differencing or hash-based chunking. However, the two technologies bring different strengths and weaknesses, making them more advantageous in some environments than others.
Managers will encounter some key deduplication terms as they sort through the various deduplication options:
Deduplication presents a number of tradeoffs. Which tradeoff an organization makes depends on the organization, its data, and its situation.
For example, it might seem that squeezing more data into less space would mean there's more room to cram in new data, but that's not how data deduplication works. Adding more unique data doesn't take advantage of the space savings provided by pointers to redundant data. Instead, deduplication makes it possible to store more backups of the same data set for a longer time in the same amount of space. This translates into a faster recovery, especially when an older version of data is needed, because it is more likely to still be online. But it doesn't necessarily translate into freeing up room for more unique data.
Inline versus post-processing and source versus target deduplication present similar tradeoffs. Source-based and inline work well if the organization has the time and source resources to do it and scalability isn’t an issue. Post-processing and target-based usually are preferred when speed and scalability are issues.
Deduplication ratios present another area of confusion. Deduplication ratios (the ratio describing how much data goes into the deduplication process compared to how much ultimately is stored) can be calculated a number of ways with widely varying results. For the purposes of this paper, approach all deduplication ratio claims with skepticism until you have performed your own calculations using own actual data.
When selecting a deduplication product, it pays to understand the following:
There are a number of valid approaches to deduplication, and it is important to keep in mind that one size does not fit all. Managers must analyze their own information systems environment and their specific data protection needs. Only then can they find deduplication vendors and products to meet their specific situation and needs.
You can learn more by visiting www.hp.com/go/deduplication.
Alan Radding is a widely published business and technology writer and the research director at Independent Assessment, www.independentassessment.com, 617-332-4369.