"The online business magazine at the heart of international business management news..."
New Account

The Magazine

Issue 13

E-magazine
  • Previous Issues

Blog

Spencer Green
Chairman, GDS International

Sales and the 'Talent Magnet'

A lot is written about being a ‘Talent Magnet’, either as a company, or as President. It’s all good practice – listen, mentor, reward, provide clear goals and career maps. Good practice for the employer, but what about the employee?
25 May 2011

Adding Data De-duplication to Your Backup Environment

Quantum | www.quantum.com

No Comments

The addition of data de-duplication to disk backup system is changing the backup and recovery environment. But is it the only useful solution? Does it make other backup technologies obsolete? The answer is clearly no. Storing data on disk in a de-duplicated format and using the de-duplication advantage to replicate data between sites is a huge step forward for many organizations facing data protection problems. But these technologies are part of a comprehensive backup, retention, and archiving strategy, and it is important to understand how they fit with other technologies, including conventional disk and tape storage systems – the D2D2T systems that many users are already deploying.

Most end users understand the general strengths of the D2D2T elements. Conventional disk systems are well suited for data with short retention periods, for data that requires frequent access, and for data that doesn’t benefit from de-duplication technology. Tape systems are well suited for data that needs long-term retention with infrequent access, data that benefits from the lowest power and cooling, and data that needs to be stored on removable media. For many, data duplication technology fills the gap between the two. It provides performance similar to conventional disk but extends retention periods and reduces costs dramatically. When it is used to enable replication across multiple sites, it can also provide automated DR protection that increases the number of recovery points, reduces the management of removable media, and lowers costs. The rest of this editorial looks at what de-duplication does, where it works, what it’s limits are, and how it adds a new and important layer to the D2D2T landscape.

Data De-duplication

The Storage Networking Industry Association (SNIA) defines data de-duplication as technology that examines “a data-set or I/O stream at the sub-file level, storing and/or sending only unique data.” Data de-duplication can use several different approaches to accomplish this data reduction (the most wide-spread method is variable-length block based de-duplication), but it is extremely powerful for backup data, where most IT departments store highly redundant data sets over and over again.

De-duplication reduces disk capacity requirements by 90% or more for recurring backup operations, and it reduces the bandwidth required to replicate data by a similar factor by eliminating the need to transmit duplicate data elements. IT Managers benefit from storing more recovery points on the system and having the capability to restore data faster and from more recovery points. Since de-duplicated data requires less bandwidth and less time to transfer to another datacenter, remote replication is now a viable tool for disaster recovery – without de-duplication’s bandwidth efficiency, few IT departments had the bandwidth or the time to replicate backups between locations.

With data de-duplication, it is becoming more common to see users storing weeks or months of daily backups on disk, replicating those datasets to off-site locations for short-term site-loss protection, and transferring only the data to tape that needs long term retention – often creating tapes only once a week or once a month. Most file restores are from local disk, but data is protected by having multiple copies available in different locations. As part of the overall tiered architecture, fewer tapes are shipped off-site for long-term retention and disaster recovery protection. With faster replication, the consolidation of data in one place is more easily accomplished, enabling the centralization of tape management. Instead of having multiple, small tape systems dispersed among a number of remote offices shipping tapes to a central datacenter, a de-duplicated disk system with remote replication capability can transfer datasets to a central system where copies can be written off to tape.

The most important strengths of de-duplicated disk are the opportunity to reduce the total cost of ownership by replacing racks full of conventional disk arrays, and to reduce the management of removable media in distributed environments by implementing a replication strategy. It is possible to achieve a measurable reduction in power, cooling, and datacenter footprint even with a modest de-duplication factor. In addition to lowering the Total Cost of Ownership (TCO) for disk backup operations, the remote replication features also lower operating expenses by enabling the elimination of tape systems and media at “remote offices”.

The Limits of Data De-duplication

Data de-duplication is powerful technology but not every environment or type of data can benefit from it.

Environments with high data growth rates or high rates of change typically have too much unique data day to day to allow for any considerable reduction in the size of the backup data. For data that is almost entirely unique (encrypted data) or uncompressible (digital images), there is little benefit of using de-duplication. There may also be very little benefit for data that does not have to be retained over time, or where the data that is retained consists of images with significant amounts of unique data. For example, four full backups created on four successive days will normally have a high degree of redundant data while four full backups created on the first day of each calendar quarter will normally have much less similarity.

Data de-duplication also inevitably introduces some amount of processing overhead which has an impact on performance, and different approaches to the technology change where the overhead is taken. Source-based de-duplication (where the technology is built into software running in the primary data environment), puts the overhead on production servers where it can slow down the backup job or the primary applications. Target based de-duplication (where an application sends data to a target system that de-duplicates it when it receives it) also has more than one way of handling the overhead. In-line systems process all the data during ingest, so the overhead has the potential to slow down the backup job.

Deferred, or post processing approaches, initially allow data to land on disk and then de-duplicate it outside the backup window – de-duplication won’t slow down the backup but more disk is required. De-duplicating during ingest uses less disk, but conventional in-line approaches can slow backups down in some circumstances – often when they see a large amount of new data. Newer adaptive methods, the approach that Quantum pioneered, de-duplicates during ingest but creates a disk buffer as well. It behaves like conventional in-line methods most of the time, but can adapt to faster ingest and avoid slowing down the backup so backup windows stay short. The newest generation of de-duplication systems – of which the Quantum DXi7500 is the first – lets users pick different de-duplication methods for different backup tasks so users can get the right combination of performance and disk utilization for their unique mix of data.

The Ideal Environment for De-duplication

De-duplication systems are ideally suited for use in environments where the data being backed up has a reasonable amount of redundancy and the data is being retained for at least a moderate length of time – several weeks. If fact, the more redundancy in the data and the longer the data is retained, the more effective the de-duplication system will be. A combination of highly redundant data with frequent, full backups maximizes the capacity savings.

Examples of environments well suited for use with de-duplication include email systems, collaboration systems (with or without single instance storage capabilities), asset management systems, virtual servers (backing up the systems, their configurations, and related data), and home directories/file sharing systems.

Companies that want to reduce or eliminate tape systems in their regional or branch offices will be able to link their de-duplicated disk systems and securely replicate data between sites, automating their business continuance/disaster recovery processes. This in turn can reduce their administrative expenses and increase the overall reliability of their backup system.

The Impact of Data De-duplication on D2D2T Approaches

De-duplicated disk-based systems have changed the rules for D2D2T backup and recovery systems. De-duplicated disk backup systems, combined with tape and a capable data management application, enable IT Managers to realize a broad range of benefits over and above the typical D2D2T systems. The combined approach lets users…

  • Meet more time sensitive service level agreements and their related RPOs and RTOs by retaining more backups on disk instead of tape
  • Reduce operating costs (space, power, cooling) by operating fewer devices with more capacity
  • Leverage faster replication to remove tape from remote sites, lowering operating and administrative expenses
  • Centralize tape operations to increase security and limit risk of data loss
  • Lower the overall TCO by reducing the number of disk systems required for backup as well as reducing the number of tapes created, managed, copied, shipped, and stored off-site

De-duplicated disk systems provide comparable performance and the same reliability of conventional disk systems with a significant capacity savings (90%+). This considerable improvement in capacity utilization allows the retention of more datasets for a longer period, providing greater restore performance to meet increasingly demanding RPOs and RTOs.

De-duplication makes remote replication faster and easier. With a de-duplicated backup, only the unique data is transferred to the centralized storage system. In the same way that capacity is optimized locally, data transfer is also optimized. With efficient replication, it is now possible to replace tape libraries in remote offices with de-duplicated disk systems and replicate data to a single data center or disaster recovery site where all tape creation tasks can be centrally administered.

Summary

D2D2T backup systems have proven their ability to provide the necessary combination of speed and reliability along with short- and long-term retention. With the addition of de-duplicated disk-based backup systems, the benefits of D2D2T have been enhanced. More effective disk capacity provides better short-term to medium-term retention, while reducing the amount of tape needed for long-term retention, by making remote replication a viable option for a distributed enterprise.

Achieving a balance between performance and cost is a matter of implementing each technology appropriately in the data protection lifecycle. De-duplication is a key technology that is bridging the gap between conventional disk and tape, bringing better balance between accessibility and retention, while improving TCO.


More like this...

Disclaimer: All comments posted in a personal capacity
POST A COMMENT
In order to post a comment you need to be regsitered and signed in.
Register | Sign in
No Comments Have Been Submitted
Disclaimer: All comments posted in a personal capacity