Bob Wilson, Vice President of HP’s Storage Platforms Division, explains why deduplication is creating a significant buzz and shares his thoughts on the exciting future of this emerging technology in an exclusive interview with BM.
BM. Emerging technologies like deduplication are creating a significant buzz in areas such as disk-based backup, what are the main trends and challenges that make deduplication of interest to customers?
BW. A key trend that is challenging customers is data explosion. By 2010, there will be a six-fold increase in the world’s information. Look at the Web 2.0 space, it has more than 150 million users and they are growing by roughly a quarter million a day and they are all creating data. Organizations are having severe challenges just storing the data they have today, imagine how much more difficult and expensive storing six times more data will be in just a few short years.
That’s why deduplication is of interest to customers – it is a very simply technique for eliminating redundancy in much of the data being stored today. A deduplication system eliminates redundant data, significantly reducing the amount of storage capacity needed to keep that data. Deduplication technology looks at data on a sub-file or block level and attempts to determine if it’s seen the data before. If it hasn’t, it stores it. If it has seen it before, it ensures that it is stored only once, and all other references to that data are merely pointers.
BM. What are the different types of deduplication?
BW. There are different methods to deduplicate data. The easiest way to describe it is by discussion of where, when and how it’s done. Every method has pluses and minuses. There is no one right solution that fits everyone’s needs. As a result, you have to choose a solution with a deduplication technology that best suits your needs.
First, where does data deduplication occur? A source-based approach results in less data being sent across the network for backup, potentially shortening backup windows. A target-based approach is well suited for a virtual tape device and therefore can augment tape backup and speed up data retrieval processes.
Next, when does deduplication happen? In target-based implementations, data can either be backed up first, then deduplicated (post-process), or deduplication can be executed during the backup process (inline). Each method has pros and cons: Post-process deduplication may result in a faster backup, but inline can be replicated immediately after a backup concludes.
Finally, how deduplication is achieved?Object-level differencing products reduce data by storing only the changes that occur between multiple revisions of a file, while hash-based deduplication products locate global redundancies that may occur between all of the files within a backup repository. While vendors of byte-level differential products claim their content awareness makes them more efficient with their deduplication processes, hash-based deduplication vendors believe their content-agnostic approach allows them to use their technology outside the core backup market.
BM. Where do the different types of deduplication work best?
BW. Large enterprises have issues meeting backup windows, so any deduplication technology that could slow down the backup process is of no use to them and just as important any deduplication technology that slows down restore times is not welcome either. Many large customers back up hundreds of terabytes per night, and their backup solution with deduplication needs to scale up to these capacities without degrading performance. Fragmenting the approach by having to use several smaller deduplication stores would also make the whole backup process harder to manage.
Midsize customers are concerned about backup windows as well but to a lesser degree. Smaller organizations or remote offices generally need an easy approach – a dedicated appliance that is self-contained – at a reasonable cost. These types of environments do not need a system that is infinitely scalable or the associated price for scalable capacity and performance. They need a single engine approach that can work transparently in any of their environments.
These different priorities are what have led HP to develop two distinct approaches to data deduplication.
HP Accelerated Deduplication technology is designed for large enterprise data centers. It is the technology HP has chosen for the HP StorageWorks Virtual Library Systems. Accelerated Deduplication utilizes object-level differencing technology with a design centered on performance and scalability and delivers fastest possible backup performance. It leverages post-processing technology to process data deduplication as backup jobs complete, deduplicating previous backups whilst other backups are still completing. HP Accelerated Deduplication technology also delivers fastest restore from recently backed up data, and offers highly scalable deduplication performance.
HP Dynamic Deduplication is designed for customers with smaller IT environments. It is the technology that HP uses in the HP StorageWorks D2D Backup System. HP Dynamic Deduplication utilizes hash-based chunking technology with its main design center around compatibility and cost. Independence from backup applications, systems with built-in data deduplication and flexible replication options for increased investment protection are other benefits of HP Dynamic Deduplication.
BM. Where is deduplication used today?
BW. Most vendors are deploying deduplication technology in disk-based backup systems. There have been some effective use of deduplication in NAS file servers, but usually the duplication takes place only at the file level. In an environment where files are being constantly updated and saved this kind of deduplication is less effective since even if only a small part of a file changes the whole file must be saved. Whereas with block level deduplication even a small change in a file doesn’t require the whole file to be saved but only the portion of the file that changed.
BM. What are the top three reasons why enterprises are utilizing deduplication technology?
BW. The most obvious benefit is that you are spending less money for storage capacity because you are storing less data, so there’s a cost-efficiency that customers achieve. Another benefit is that more data can now be kept available and online in a virtual tape library with deduplication. Instead of only keeping a week’s worth of backups on the virtual tape library and relying on tapes for older data, you can keep several months of backup data on your virtual tape library, making it easier to restore older data. Anyone who has had to bring tapes back from an archive to restore an older file knows exactly what I’m talking about here. The last benefit I’d highlight is easier management that will help lower the total cost of ownership. With virtual tape libraries using deduplication, there is a labor saving from no failed backup jobs, bad tapes and less tape handling.
BM. Can you outline some of the common mistakes or pitfalls that customers should be aware of as they implement virtual tape libraries with deduplication?
BW. Probably the biggest one is comparing technologies based numbers being quoted for deduplication ratios delivered by products in the industry. It turns out that what ratio you will actually get is mostly dependent on a few things: the type of data, your backup policy, how frequently data changes and how you measure the ratio. For example, PACs are ‘picture archiving and communication systems,’ a type of data used in X-rays and medical imaging. These have very little duplicate data. At the other end of the spectrum, databases contain a lot of redundant data, their structure means that there will be many records with empty fields or the same data in the same fields.
For larger enterprises you also want to make sure the deduplication technology is a fully integral part of your disk-based backup systems and that the deduplication engine performance can scale as the capacity of your systems scale. There are some disk-based systems that use bolt-on deduplication engines that are very inefficient at scaling as capacity and performance demands increase.
BM. Where do you see deduplication heading in the next six to 12 months?
BW. Deduplication can automate the disaster recovery process by providing the ability to perform low-bandwidth site-to-site replication at a lower cost. Because deduplication knows what data has changed at a block or byte level, replication becomes more intelligent and transfers only the changed data as opposed to the complete data set. This saves time and replication bandwidth and is one of the most attractive propositions that deduplication offers. Customers who do not use disk based replication across sites today will embrace low-bandwidth replication, as it enables better disaster tolerance without the need and operational costs associated with transporting data off-site on physical tape.
The top five reasons why enterprises are using deduplication technology: