Where our team of editors discuss what they think about the current BM issues.

The volume of data generated by companies today is growing explosively. More powerful computing technology and the evolution to an information-based economy are causing companies to generate more data than ever before. The process of backing up all of this data leads to a completely new set of challenges. Companies typically backup the same data many times over its lifecycle. As a result, a single terabyte of new data can require 50 to 60 times that capacity to store it over its lifetime.
In addition, laws such as Health Information Portability and Accountability Act, and Sarbanes-Oxley require some types of data to be store for many years. They also require companies to be able to retrieve that data quickly and completely upon request.
To deal with this overwhelming data growth and related storage requirements, many companies are evaluating the use of data deduplication technology. Data deduplication technology is software that compares data in new backup streams to data that has already been stored to identify and remove duplicates. For example, if only 5% of the data in a current backup stream has changed since the previous backup, the VTL with deduplication technology will only store that 5%. A record is kept of the duplicate data so the files can be reassembled for data restores.
Changing the Economics of Data Protection
Virtual tape libraries provide a level of performance and reliability that traditional physical tape systems cannot approximate. VTLs enable companies to backup many times faster than tape, restore data quickly, and eliminate a variety of time-consuming manual tasks. However, without data deduplication, the cost of disk is higher than that of tape, forcing companies to prioritize data protection and reserve VTL backups for only business-critical data.
Until now, data managers had to use disk space carefully by keeping online retention times short and moving data to physical tape as quickly as possible. With data deduplication, this prioritization is not necessary. When used with hardware compression on a VTL, deduplication can deliver as much as 50:1 capacity reduction, making disk-based secondary storage and longer online data retention times cost-effective for the enterprise.
The methods used to accomplish deduplication vary widely as do the levels of capacity optimization they can provide. Some techniques are well suited to small-to-medium sized backup environments and others are optimal for enterprise-class environments. This article will describe the techniques being used today to deduplicate data on VTLs. It will summarize the backup environment and data protection objectives that each technology is best suited to address.
Understanding Your Needs
Amid the hype and hyperbole surrounding data deduplication, data managers need to keep their priorities in focus when choosing a new technology. Start the process with a clear understanding of your needs.
Approaches to Deduplication
The fundamental function of all deduplication technologies is to compare the data in a backup set to the data that has already been stored to prevent the storage of duplicates. Performing this comparison at too granular a level – comparing every bit of backup data to every bit of previously stored data – would produce excellent results, but would be too time consuming and process-intensive to be feasible. Comparing data at too gross a level would be faster, but would miss a significant amount of duplicate data.
There are two general ways that deduplication technologies solve this dilemma – hash-based comparison and the ContentAware™ comparison used by SEPATON ® DeltaStor ® deduplication software on a virtual tape library (VTL).
Hash-Based Comparison
The hash-based approach breaks data into chunks and assigns a number (called a hash) to each chunk. It keeps a record of all of the hashes in an index. To find duplicate data, it compares the new incoming hashes to hashes that have already been stored in the index. If a new hash is not already in the index, its corresponding data is backed up and the hash is added to the index. If a new hash matches one in the lookup table, the corresponding data is not backed up. Instead, a marker is stored. To restore data, it uses the markers to assemble the chunks of stored data into full files. Over time, backups are broken into more and more chunks of data that are scattered on the disk. As a result, restoring files is processing intensive and time-consuming.
ContentAware Comparison Approach
The ContentAware approach actually reads the data that is in the backup and identifies commonalities and relationships between the objects/documents (e.g., Microsoft ® Word document to Word document or Oracle ® database to Oracle database) to narrow the search for duplicate data. It then examines that data at the most granular (byte) level. Before any data is deleted or space is reclaimed, it also compares deduplicated data with the original un-deduplicated data to ensure complete data integrity is maintained. When used with the VTL’s built in hardware compression, it can reduce capacity by 50:1 or more. Although this approach requires slightly more disk space than hash-based approaches, it has the advantage of being able to handle any size backup and to scale both performance and capacity to address enterprise data needs. It also delivers a significantly higher level of deduplication efficiency.
Speed to Safety
Another distinction between deduplication technologies is whether they deduplicate a given backup set inline as part of the backup process or after the backup process completes. Inline deduplication aligns well with hash-based comparison technologies and provides a cost-effective way for small to medium-sized organizations to reduce their data center capacity needs. However, most technologies that deduplicate inline cannot scale performance to the levels needed by large enterprises. They also slow down backup performance and do not find duplicate data as efficiently as other methods.
A more efficient method begins by backup up some data (writing it to the disk), and then deduplicating data while other backup data enters the system. This concurrent process enables the VTL to backup data at wire speed and to deduplicate the data efficiently .
A key distinction among deduplication technologies is the time they require to complete the deduplication process and to recapture capacity. A SEPATON S2100-ES2 with DeltaStor deduplication can use multiple processing nodes to deduplicate the data and recapture capacity quickly and efficiently. All other post-processing deduplication technologies are limited to a single processing node to complete the deduplication process. Forward Differencing Ensures Data Integrity, Speeds Restore Times
As described above, with hash-based technologies each new backup gets broken up into more pieces that have to be identified, compiled, and reassembled to restore. As a result, restoring recently backed up data requires significant processing and time. This restore performance gets worse over time.
In contrast, the ContentAware approach uses the most recent (newest) backup as the reference data set. Duplicates found in older backups are replaced with pointers forward to the most recent backup. In this way, restore requests can be processed instantaneously with little or no reassembly.
Fine Tuning for Optimal Results
Most deduplication technologies are “all or nothing,” requiring you to deduplicate all of your backup data and to do so in the same way across all data types. This method is adequate for small backup environments. However, in an enterprise, being able to fine-tune deduplication to your needs, data types, and business objectives is essential. The efficiencies to be gained through deduplication depend on a number of factors, including (but not limited to):
The ContentAware approach enables enterprises to tune their deduplication for optimal results. You cannot only choose the data you want to deduplicate by application, server, and backup application, but also apply three levels of deduplication to individual backups as you choose. As a result, you can achieve the fastest, most highly efficient deduplication processing in the industry.
Management Console for Maximum Control
Whether you want to monitor the capacity reclamation process or plan for future capacity requirements, DeltaStor software provides a simple, fully integrated management console and detailed reporting interface. The management console provides online access to informative graphic displays showing the progress of all backup and capacity reclamation activities. A detailed reporting function lets you see trends, usage, and backup efficiencies for maximum control and management effectiveness.
Conclusion
At a cost-per-gigabyte comparable to physical tape, DeltaStor deduplication technology with a SEPATON S2100-ES2 VTL is the only solution that provides the performance, capacity, and management control enterprises need backup, restore, and protect tens of petabytes of data annually.
As CTO for SEPATON Inc., Miklos Sandorfi is responsible for the company’s product vision and roadmap. He has an extensive background in the development of enterprise-class storage systems and has 10 granted and 10 pending patents in Fibre Channel and Disk Subsystem I/O technology.
SEPATON, S2100, and DeltaStor are registered trademarks and ContentAware is a trademark of SEPATON, Inc. Other product and company names mentioned herein may be trademarks and/or registered trademarks of their respective companies.
© 2008 SEPATON Inc. All rights reserved.