
IT and data center managers are facing a crisis. The volume of data generated by most companies has grown at such an explosive rate that many data centers have simply run out of the space, power, cooling, and storage capacity to handle it. Fundamental issues of insufficient capacity are compounded by the increasingly stringent regulatory requirements and business initiatives, which demand higher service levels, longer online retention times, and higher levels of data protection. For enterprise data centers, meeting these demands is particularly challenging. In these large organizations, the sheer volume and variety of data types to be protected requires a level of performance, scalability, and flexibility that few technologies can deliver.
To handle this data growth, while meeting regulatory and business requirements, enterprise IT staff members have to address several technical objectives. They need to:
Instead of simply adding more and more tape libraries to solve these issues or completely changing your infrastructure to add disk as a primary target, consider using a virtual tape library (VTL) with deduplication to offset data growth. Deduplication technology is typically a software application that significantly reduces the disk capacity required for data storage, enabling you to keep more data online longer while using less floor space and consuming less power.
However, there are several different types of deduplication technologies – some are better suited for small to medium-sized environments, others are designed for enterprise-class data protection.
The basics of deduplication
There are two basic categories of data deduplication technology: hash based and byte-level comparison deduplication. The hash-based approach runs incoming data through a hashing algorithm to create a small representation of the data and a unique identifier for that piece of data called a hash. It then compares the hash to previous hashes stored in a lookup table. If a match is found, then the duplicate data is replaced with a pointer to the existing hash. If a match is not found, then the data is added to the lookup table.
As its name suggests, byte-level comparison deduplication is designed to look for duplicate data by comparing data at a more granular (byte) level. For efficiency, it first compares data as objects (e.g. Word document to Word document or Oracle database to Oracle database) and identifies likely redundancies. It then uses advanced pattern matching to find duplicate bytes of data.
A key distinction between deduplication technologies is whether the deduplication process is done inline as part of the backup process or after the backup is completed in a post process step on the VTL.
High performance: staying within backup windows
To minimize disruption to end-user productivity, enterprise data managers face the challenge of backing up data within acceptable backup timeframes or backup windows. Faster and faster performance is needed to meet this challenge as data volumes increase. A primary consideration in choosing a deduplication technology for an enterprise is its ability to backup terabytes or petabytes of data fast enough to stay within its backup window.
Although in-line deduplication technologies are adequate for meeting small to medium-size backup requirements, they are not able to scale performance or capacity to the levels needed for an enterprise environment. For example, many deduplication solutions top out at backup rates of 800 Gb/hr per appliance. At this rate, to backup 10 TB of data in an eight-hour backup window, you would need numerous, independently managed appliances. This adds significant complexity and requires you to modify backup infrastructure/policies. As your data grows, more appliances need to be deployed and managed. This creates “silos of deduplication” and a management challenge. Overall efficiency of deduplication is also reduced because the data comparisons that identify duplicate data are only performed within individual devices.
Deduplication technologies that use the post-process method can scale performance easily to handle much larger volume backups within a typical backup window. In addition, because they back up the full data set before removing redundant data, the post process method enables a more rigorous data integrity checking capability. Truly enterprise-class deduplication solutions can scale performance as high as 17 TB/hr and handle petabytes of data in a single appliance.
Realistic expectations for capacity reduction
Deduplication approaches and results vary widely among deduplication solutions, as does the time required to achieve maximum deduplication. The effectiveness of deduplication technology also depends heavily on your specific backup policies, backup application and the mix of data types you are backing up. It stands to reason that the more duplicate data in your backup streams, the more beneficial your deduplication technology is likely to be.
An enterprise-class deduplication solution should be able to reduce a typical mix described above by 25:1 and 50:1 when combined with standard hardware compression on the VTL. Look for vendors that will test and characterize samples of your backup data and provide clear expectations of the levels of deduplication you can expect from their technology-- before you buy.
Also be aware that the deduplication ratios many vendors claim only apply to full backups. Some deduplication technologies perform far less efficiently on incremental backups or “incrementals forever” backup scenarios, such as those performed by Tivoli Storage Manager. Read the fine print on the data sheets. Ask for references from customers that are using the same backup application, similar policies and data types as you.
Restore performance
Backing up data efficiently is only half the challenge. To be successful, you need to restore data quickly and efficiently. In fact, one of the key drivers for adopting deduplication technology is the ability to keep data on disk longer in order to simplify and accelerate restore times. Before adopting a new deduplication technology, be sure to test restore times and efficiency.
For some technologies, the gating factor in restore performance is in the VTL itself. Many VTLs require you to restore data through the same port that was used to back up the data in the first place, causing a bottleneck. Choose a VTL that enables you to backup and restore data through any port.
Another key distinction between deduplication technologies is the method they use to store and retrieve pointers to duplicate data. Some solutions use the first backup as the reference copy and compare all subsequent backup data to it to identify duplicates. Duplicate data is replaced with pointers to this reference copy. Over time, a particular document that has been modified and backed up repeatedly is reduced to a number of pointers. To restore these files, the deduplication software has to locate and reassemble weeks or months of pointers, a potentially time-consuming process.
In contrast, other technologies store the most recent back up as a full reference copy and replace all previously stored duplicate data with pointers to it. As a result, recently backed up data (within the previous 30 days) can be restored with little to no reassembly from pointers required.
Ensuring data integrity
Enterprise deduplication requires guaranteed data integrity. Look for solutions that guarantee data integrity. Enterprise class solutions perform a data integrity check that compares the deduplicated data to the original data set at the byte level before any duplicate data is deleted or disk space is redeployed. This comparison needs to ensure that when deduplicated data is reconstructed, it is identical to the original backup at the byte level.
Enterprise class reliability
Since the deduplication solution may be the primary source for recovery for weeks or months of data, the base platform should have the type of reliability and availability features as those found in enterprise class disk solutions including:
Tuning to your environment
To meet the complex business and regulatory requirements for data protection, each enterprise has its own specific policies, procedures, and requirements. Unlike smaller organizations, where a “one size fits all” approach may suffice, enterprises need a solution that can be tuned to support their policies and procedures as well as to specific environmental requirements.
People factor
Before you trust your data to a new technology, choose a vendor with experience in the specific data protection requirements of enterprise-size organizations. To be effective, you need to work closely with a company that will help you configure a solution that best meets your needs and addresses the specific requirements of your backup applications.
A VTL with deduplication is a powerful solution for enterprises, but as with any major storage solution purchase, be sure to complete the due diligence. Beware solutions that are built on off-the-shelf servers and low-end storage without enterprise class reliability/availability features.
About the author
As CTO, Miki Sandorfi is responsible for SEPATON’s technology vision and roadmap. Prior to SEPATON, he served many years at EMC Corporation, where he drove technological advances in several product categories. Miki was instrumental in developing the Fibre Channel protocol and bringing Fibre Channel connectivity to EMC products. Miki has 10 granted and 10 pending patents in Fibre Channel and Disk Subsystem I/O technology. He has a BS in Engineering from Northeastern University and completed executive education with Babson College.