"The online business magazine at the heart of international business management news..."
New Account

The Magazine

Issue 13

E-magazine
  • Previous Issues

Blog

Where our team of editors discuss what they think about the current BM issues.

Seth Shaw
VP of Sales and Marketing - LogMeIn

Don't miss your connection!

Seth Shaw, VP of Sales and Marketing at LogMeIn, discusses how business travellers can stay connected during their travels
05 Jul 2010

Comparing Deduplication Approaches: Technology Considerations for Enterprise Environments

Sepaton | www.sepaton.com

No Comments

The volume of data generated by companies today is growing explosively. More powerful computing technology and the evolution to an information-based economy are causing companies to generate more data than ever before. The process of backing up all of this data leads to a completely new set of challenges. Companies typically backup the same data many times over its lifecycle. As a result, a single terabyte of new data can require 50 to 60 times that capacity to store it over its lifetime.

In addition, laws such as Health Information Portability and Accountability Act, and Sarbanes-Oxley require some types of data to be store for many years. They also require companies to be able to retrieve that data quickly and completely upon request.

To deal with this overwhelming data growth and related storage requirements, many companies are evaluating the use of data deduplication technology. Data deduplication technology is software that compares data in new backup streams to data that has already been stored to identify and remove duplicates. For example, if only 5% of the data in a current backup stream has changed since the previous backup, the VTL with deduplication technology will only store that 5%. A record is kept of the duplicate data so the files can be reassembled for data restores.

Changing the Economics of Data Protection

Virtual tape libraries provide a level of performance and reliability that traditional physical tape systems cannot approximate. VTLs enable companies to backup many times faster than tape, restore data quickly, and eliminate a variety of time-consuming manual tasks. However, without data deduplication, the cost of disk is higher than that of tape, forcing companies to prioritize data protection and reserve VTL backups for only business-critical data.

Until now, data managers had to use disk space carefully by keeping online retention times short and moving data to physical tape as quickly as possible. With data deduplication, this prioritization is not necessary. When used with hardware compression on a VTL, deduplication can deliver as much as 50:1 capacity reduction, making disk-based secondary storage and longer online data retention times cost-effective for the enterprise.

The methods used to accomplish deduplication vary widely as do the levels of capacity optimization they can provide. Some techniques are well suited to small-to-medium sized backup environments and others are optimal for enterprise-class environments. This article will describe the techniques being used today to deduplicate data on VTLs. It will summarize the backup environment and data protection objectives that each technology is best suited to address.

Understanding Your Needs

Amid the hype and hyperbole surrounding data deduplication, data managers need to keep their priorities in focus when choosing a new technology. Start the process with a clear understanding of your needs.

  • Backup Performance and Time to Protection. Be sure to understand how a data deduplication technology will affect your backup time and how quickly your data will be moved to the protection of a VTL. Understand whether this backup performance will slow down over time. Ensure that you will be able to stay within your backup window and that your data is not at added risk through the backup process.
  • Restore Performance. Backing up data efficiently is only half the job. Choose a technology based on three key characteristics of your file restore needs: how often you need to restore files; the age of the files you typically restore (i.e., how often are files more than 30 days old) and how quickly you need to complete file restores.
  • Deduplication Efficiency . The capacity reduction delivered by deduplication technologies vary widely. The way you use data also has a significant impact on deduplication efficiency. It makes sense that the more duplicate data you have in your backup stream, the more beneficial a deduplication technology will be in your environment. Understand what level of deduplication efficiency is realistic in your environment and whether that is sufficient to offset your data growth.
  • Risk to Data Integrity . The process of deduplication takes data apart and it deletes or never stores duplicate data. Understand the relative risk of data loss or corruption for the deduplication technology you choose.
  • Capacity Scalability . Even with highly efficient deduplication technology in place, you will eventually run out of capacity. Before choosing a technology, understand the implications of outgrowing your capacity. Will it mean maintaining numerous “silos of storage,” Or a forklift upgrade to a new system?
  • Performance Scalability. Will a deduplication solution slow your backup performance? Many deduplication technologies cannot scale backup performance or deduplication processing across multiple processing nodes. As a result, you have to add multiple individually managed boxes (see capacity scalability above), or tolerate significantly slower backup times.

Approaches to Deduplication

The fundamental function of all deduplication technologies is to compare the data in a backup set to the data that has already been stored to prevent the storage of duplicates. Performing this comparison at too granular a level – comparing every bit of backup data to every bit of previously stored data – would produce excellent results, but would be too time consuming and process-intensive to be feasible. Comparing data at too gross a level would be faster, but would miss a significant amount of duplicate data.

There are two general ways that deduplication technologies solve this dilemma – hash-based comparison and the ContentAware™ comparison used by SEPATON ® DeltaStor ® deduplication software on a virtual tape library (VTL).

Hash-Based Comparison

The hash-based approach breaks data into chunks and assigns a number (called a hash) to each chunk. It keeps a record of all of the hashes in an index. To find duplicate data, it compares the new incoming hashes to hashes that have already been stored in the index. If a new hash is not already in the index, its corresponding data is backed up and the hash is added to the index. If a new hash matches one in the lookup table, the corresponding data is not backed up. Instead, a marker is stored. To restore data, it uses the markers to assemble the chunks of stored data into full files. Over time, backups are broken into more and more chunks of data that are scattered on the disk. As a result, restoring files is processing intensive and time-consuming.

ContentAware Comparison Approach

The ContentAware approach actually reads the data that is in the backup and identifies commonalities and relationships between the objects/documents (e.g., Microsoft ® Word document to Word document or Oracle ® database to Oracle database) to narrow the search for duplicate data. It then examines that data at the most granular (byte) level. Before any data is deleted or space is reclaimed, it also compares deduplicated data with the original un-deduplicated data to ensure complete data integrity is maintained. When used with the VTL’s built in hardware compression, it can reduce capacity by 50:1 or more. Although this approach requires slightly more disk space than hash-based approaches, it has the advantage of being able to handle any size backup and to scale both performance and capacity to address enterprise data needs. It also delivers a significantly higher level of deduplication efficiency.

Speed to Safety

Another distinction between deduplication technologies is whether they deduplicate a given backup set inline as part of the backup process or after the backup process completes. Inline deduplication aligns well with hash-based comparison technologies and provides a cost-effective way for small to medium-sized organizations to reduce their data center capacity needs. However, most technologies that deduplicate inline cannot scale performance to the levels needed by large enterprises. They also slow down backup performance and do not find duplicate data as efficiently as other methods.

A more efficient method begins by backup up some data (writing it to the disk), and then deduplicating data while other backup data enters the system. This concurrent process enables the VTL to backup data at wire speed and to deduplicate the data efficiently .

A key distinction among deduplication technologies is the time they require to complete the deduplication process and to recapture capacity. A SEPATON S2100-ES2 with DeltaStor deduplication can use multiple processing nodes to deduplicate the data and recapture capacity quickly and efficiently. All other post-processing deduplication technologies are limited to a single processing node to complete the deduplication process.   Forward Differencing Ensures Data Integrity, Speeds Restore Times

As described above, with hash-based technologies each new backup gets broken up into more pieces that have to be identified, compiled, and reassembled to restore. As a result, restoring recently backed up data requires significant processing and time. This restore performance gets worse over time.

In contrast, the ContentAware approach uses the most recent (newest) backup as the reference data set. Duplicates found in older backups are replaced with pointers forward to the most recent backup. In this way, restore requests can be processed instantaneously with little or no reassembly.

Fine Tuning for Optimal Results

Most deduplication technologies are “all or nothing,” requiring you to deduplicate all of your backup data and to do so in the same way across all data types. This method is adequate for small backup environments. However, in an enterprise, being able to fine-tune deduplication to your needs, data types, and business objectives is essential. The efficiencies to be gained through deduplication depend on a number of factors, including (but not limited to):

  • The amount of duplicate data in the typical backup stream
  • The data application type (Microsoft Exchange, Oracle, etc.)
  • Data retention periods – longer retention times result in greater deduplication efficiency

The ContentAware approach enables enterprises to tune their deduplication for optimal results. You cannot only choose the data you want to deduplicate by application, server, and backup application, but also apply three levels of deduplication to individual backups as you choose. As a result, you can achieve the fastest, most highly efficient deduplication processing in the industry.

Management Console for Maximum Control

Whether you want to monitor the capacity reclamation process or plan for future capacity requirements, DeltaStor software provides a simple, fully integrated management console and detailed reporting interface. The management console provides online access to informative graphic displays showing the progress of all backup and capacity reclamation activities. A detailed reporting function lets you see trends, usage, and backup efficiencies for maximum control and management effectiveness.

Conclusion

At a cost-per-gigabyte comparable to physical tape, DeltaStor deduplication technology with a SEPATON S2100-ES2 VTL is the only solution that provides the performance, capacity, and management control enterprises need backup, restore, and protect tens of petabytes of data annually.

As CTO for SEPATON Inc., Miklos Sandorfi is responsible for the company’s product vision and roadmap. He has an extensive background in the development of enterprise-class storage systems and has 10 granted and 10 pending patents in Fibre Channel and Disk Subsystem I/O technology.

SEPATON, S2100, and DeltaStor are registered trademarks and ContentAware is a trademark of SEPATON, Inc. Other product and company names mentioned herein may be trademarks and/or registered trademarks of their respective companies.

© 2008 SEPATON Inc. All rights reserved.


More like this...

Disclaimer: All comments posted in a personal capacity
POST A COMMENT
In order to post a comment you need to be regsitered and signed in.
Register | Sign in
No Comments Have Been Submitted
Disclaimer: All comments posted in a personal capacity