"The online business magazine at the heart of international business management news..."
New Account

The Magazine

Issue 14

From the death of Detroit and the future for a transportation network without oil to the management behind the Magic Kingdom: read our interactive magazine here.

E-magazine
  • Previous Issues

Blog

Where our team of editors discuss what they think about the current BM issues.

Seth Shaw
VP of Sales and Marketing - LogMeIn

Don't miss your connection!

Seth Shaw, VP of Sales and Marketing at LogMeIn, discusses how business travellers can stay connected during their travels
05 Jul 2010

Navigating the Deduplication Landscape

By Alan Radding of HP Storage Works


Deduplication is hot, and it is no secret why. Organizations are being swamped by a flood of data that must be backed up and protected. And, with only 24 hours in a day many organizations are finding they can’t back up today’s data fast enough before they are faced with tomorrow’s data.

Protecting data through backup not only is a long standing IT best practice but it has become the way organizations conduct business. The usual way organizations have protected data is through tape backup. Tape, however, has proven to be a problematic solution at best. It can be slow, tape handling is labor-intensive, and the results inconsistent. And backing up increasing volumes of data to tape during an overnight backup window is becoming increasingly difficult for many organizations.

Ideally, organizations would like to backup their data to disk. With disk backup, they experience faster, more reliable backups and recovery. They would have no difficulty backing up high volumes of data to disk during an overnight backup window. However, disk, even the lowest cost, slow spinning disk, is expensive compared to tape.

Moore’s Law

Moore’s Law says that computing power doubles every 18-24 months. Although originally used to describe the phenomenon of packing twice as many transistors onto a chip, thereby doubling its power, the law has generally proven accurate for storage as well. In the storage world, disk drive vendors have steadily increased the density of data packed onto a spinning disk, effectively lowering the per-gigabyte cost of disk storage.

Today, thanks to the power of Moore’s Law, the cost of disk storage has fallen to the point where it can be used effectively to back up some data. Although still more expensive than tape, current disk costs enable companies to assemble sizeable pools of disk storage for use in backing up their data and keeping it online for a period of time.

The issue might have stopped right there except for two things. One, the skyrocketing volumes of data organizations are facing is even outpacing the rate at which disk costs are falling. Two, storage managers began realizing that they were backing up multiple copies of the exact same data.

For example, s ay 500 people receive a company-wide e-mail with a 1 MB attachment. If each recipient saves that attachment, it is stored 500 times. During the nightly backup, a system would back up that one attachment 500 times – consuming 499 MB more backup space than necessary. Then, let’s say, the next day one person makes one small change to that file and sends the slightly changed file around to all the initial recipients who save it again. Now the organization will be backing up 499 more copies of a file that is exactly the same as the previously 500 copies with the exception of that one change. No matter how cheap storage capacity becomes, redundant data like this simply is wasteful.

Enter Deduplication

Data deduplication is a method for eliminating redundant data from storage, especially from backups. It works by saving a single copy of redundant data, replacing any further instances with pointers to the saved copy. So, instead of saving and backing up 500 copies of the attachment, you save just one copy and provide a pointer to it.

Just extrapolate the example above beyond e-mail to the thousands of gigabytes of data stored and backed up every month or year. Inevitably, there is redundant data. Deduplication helps by removing the redundant data, which allows the organization to free up more storage capacity, especially for backup. By doing so, the organization can maximize the value of its disk-based backup capacity to retain more backups for a longer time on a given amount of disk space. Data deduplication can also help:

  • Save money through lower disk investments
  • Free up network bandwidth
  • Reduce reliance on cumbersome tape backup
  • Recover backed up data faster after an outage

The problems caused by storing and backing up redundant data and the benefits from deduplication are so clear analysts almost unanimously cite data deduplication as a critical technology going forward. Not surprisingly, vendors are reporting surging sales of deduplication tools. One analyst firm predicted the deduplication market to go from zero just a few years ago to $1 billion in the next year or two.

How Deduplication Works

The concept behind deduplication is straightforward. When data is saved or, more often, when it is backed up the deduplication tool scans the data looking for redundant segments of data. When it finds redundant data it leaves a pointer and notes it in an index. As it encounters more copies of possible redundant data, it checks the index and uses the appropriate pointer. In this way, each piece of data is stored and backed up just once. For applications and users of stored data, the process is transparent.

Although the concept is simple enough, vendors have implemented a number of ways to deduplicate data. The various approaches have different advantages and disadvantages. For that reason, organizations need to pay attention to which deduplication approach they adopt. When it comes to data deduplication, one size does not fit all.

There are three key issues when selecting a deduplication approach:

  1. Location: Refers to deduplication occurring at the source (a server, for example) or at the target that stores the data (such as a virtual tape library). A source-based approach results in less data being sent across the network for backup, potentially shortening backup windows. However, it requires more processing to be done at the source. A target-based approach is well-suited for a virtual tape device and, therefore, able replace tape backup and speed up data retrieval processes.
  2. Timing: Refers to deduplication occurring in target-based implementations, when data can either be backed up first, then deduplicated (post-process), or deduplication can be executed during the backup process (inline). Each method has pros and cons: Post-process deduplication may result in a faster backup, but the results of the inline process can be replicated (say, offsite for remote data protection) immediately after a backup concludes.
  3. Method: Refers to the deduplication techniques employed. Object-level differencing, for example, reduces data by storing only the changes that occur while a hash-based approach locates global redundancies that occur among all files in a backup.

The best approach to data deduplication depends on the size of the organization and its backup needs.

  • Enterprises: Object-level differencing, also called accelerated deduplication, is a good choice for enterprise customers because it focuses on performance and scalability. It delivers the fastest restores, as well as the fastest possible backup by deduplicating data after it has been written to disk. It can scale up to increase performance simply by adding more nodes.
  • Midsize businesses and remote enterprise sites: Hash-based chunking, or dynamic deduplication, is a good choice for small and midsize businesses or large enterprises with remote sites because it focuses on compatibility and cost. It delivers a low-cost, small footprint in a format-independent solution.

There are many vendors in the market. Most vendors offer only one method, either object-level differencing or hash-based chunking. However, the two technologies bring different strengths and weaknesses, making them more advantageous in some environments than others.

Understanding Deduplication Terminology

Managers will encounter some key deduplication terms as they sort through the various deduplication options:

  • Source-based – deduplication that happens at the source, such as a server. Deduplication can be a processing-intensive operation, which consumes source resources and may impact the overall performance of the source. Deduplication at the source also may delay the backup process causing the organization to miss the backup window. However, it greatly reduces the amount of data sent over the network.
  • Target-based – deduplication that happens at the backup target, usually a VTL or some other disk array. The deduplication processing overhead is moved to the target, freeing up the source.
  • Inline – typically handled during the backup process, often by an appliance. Deduplication is performed at a point between the source and the target. Until global deduplication, which relies on a single global index of pointers, becomes widespread inline deduplication can be difficult to scale.
  • Post-processing – deduplication is performed after the data has arrived at the target. This form of deduplication is not directly constrained by the backup window. Any subsequent replication of the data, however, cannot be done until the post-processing is complete.
  • Low bandwidth replication – once data has been deduplicated, the substantially reduced amount of data can now be replicated offsite via low (less costly) bandwidth links for purposes of electronic vaulting or remote data protection. Low bandwidth replication becomes practical because all that is being replicated are the changes to the data after the initial backup.
  • File level, single instance – an approach to deduplication that identifies files that are the same and allows only one instance of the file to be stored and backed up. It lacks the granularity of object-level deduplication and may miss considerable duplicate data, preventing it from achieving high deduplication ratios.
  • Byte level, sub-file – looks into the file to identify duplicate data at the sub-file, object, or byte level. The more granular the view of the data, the more redundant data can be identified and replaced by pointers.

Balancing Duplication Tradeoffs

Deduplication presents a number of tradeoffs. Which tradeoff an organization makes depends on the organization, its data, and its situation.

For example, it might seem that squeezing more data into less space would mean there's more room to cram in new data, but that's not how data deduplication works. Adding more unique data doesn't take advantage of the space savings provided by pointers to redundant data. Instead, deduplication makes it possible to store more backups of the same data set for a longer time in the same amount of space. This translates into a faster recovery, especially when an older version of data is needed, because it is more likely to still be online. But it doesn't necessarily translate into freeing up room for more unique data.

Deduplication Best Practices

  • Know your data
  • Know your applications
  • Know your backup and recovery requirements
  • Know your compliance requirements and data protection mandates

Inline versus post-processing and source versus target deduplication present similar tradeoffs. Source-based and inline work well if the organization has the time and source resources to do it and scalability isn’t an issue. Post-processing and target-based usually are preferred when speed and scalability are issues.

Deduplication ratios present another area of confusion. Deduplication ratios (the ratio describing how much data goes into the deduplication process compared to how much ultimately is stored) can be calculated a number of ways with widely varying results. For the purposes of this paper, approach all deduplication ratio claims with skepticism until you have performed your own calculations using own actual data.

When selecting a deduplication product, it pays to understand the following:

  • How much data must be backed up
  • How frequently it changes
  • How big is the backup window
  • How many backups must remain online
  • Whether backups are replicated

There are a number of valid approaches to deduplication, and it is important to keep in mind that one size does not fit all. Managers must analyze their own information systems environment and their specific data protection needs. Only then can they find deduplication vendors and products to meet their specific situation and needs.

You can learn more by visiting www.hp.com/go/deduplication.

Alan Radding is a widely published business and technology writer and the research director at Independent Assessment, www.independentassessment.com, 617-332-4369.


More like this...