
Data de-duplication has the power to revolutionize the data management process. Business Management spoke to COPAN Systems’ Jon Mellon, NEC Corp.’s Karen Dutch, FalconStor Software’s Paul Kruschwitz and Quantum Corp.’s Janae Lee to find out how.
Janae Lee applies approximately three decades of expertise in software marketing, storage industry sales, business development and strategic planning to her role as VP of Corporate and Product Marketing at Quantum Corporation, where she oversees all outbound customer and partner relationship marketing and product line management.
Karen Dutch has over 20 years’ experience within the storage industry and has developed advanced storage solutions for open systems and mainframe platforms. As General Manager of the Advanced Storage Products Group at NEC Corp., Dutch is responsible for the strategy, marketing and sales of next generation storage products.
Paul Kruschwitz is the Director of Enterprise Solutions for FalconStor Software, the market leader in disk-based data protection solutions. Paul has 20 years’ experience in the computer industry, focusing on storage architecture and data protection. Before joining FalconStor, he held technology positions at Verizon and Honeywell.
Jon Mellon, SVP of Worldwide Marketing and Business Development at COPAN Systems Inc, has over two decades of top-level sales and management experience in the storage and information management industries with companies such as Teradata, a Division of NCR, EMC and Akamai.
BM. Data de-duplication is one of the hottest technologies in storage right now, enabling users to radically reduce their spending on hardware by removing duplicate data. What are the key business benefits of this technology?
PK. The demand for more storage space is increasing exponentially and will continue to do so. De-duplication technology is a near-term solution that will slow the rising costs. Conservative estimates show storage savings of 80-90 percent with de-duplication. Most firms start by focusing on the storage savings, but the real reoccurring cost savings is in the bandwidth saved when replicating the data offsite. De-duplication allows you to transfer only the unique blocks generated each day making it cost effective to replicate your backup data instead of manually shipping tapes offsite. The elimination of the human error factor and the need for tape encryption also reduces risk to the company.
KD. De-duplication is game-changing technology that enables IT to affordably store and retain more back-up and archive data online to meet increasing compliance and e-discovery demands, reduced backup windows and recovery time objectives. With NEC HYDRAstor’s patent-pending, in-line de-duplication, disk space can be reduced by as much as 95 percent – dramatically reducing the disk capacity needed to store weeks, months and years of backup and archive and at cost competitive with tape-based systems (less than $1/GB). Because less disk capacity is required, maintenance and environmental costs – such as power – are significantly reduced.
De-duplication also enables cost-effective, bandwidth-friendly replication, which simplifies disaster recovery and eliminates today’s risk of tape loss or theft. Businesses can eliminate compliance and e-discovery risks associated with using tapes – which are unwieldy to manage; can be lost, stolen or damaged; and may not be readily accessible. By making disk affordable, de-duplication improves backup, restore and archive performance, reliability and security to meet ever-increasing stringent data protection and compliance requirements.
JL. The business benefits from data de-duplication start with increasing overall data integrity and end with reducing overall data protection costs. Data de-duplication lets users reduce the amount of disk they need for backup by 90 percent or more. With reduced acquisition costs – and reduced power, space and cooling requirements – disk becomes suitable for first stage backup and restore and for retention that can easily extend to months. With data on disk, restore service levels are higher, media handling errors are reduced, and more recovery points are available on fast recovery media. What all of that really means is that data protection is improved, service is faster, and costs are reduced.
JM. The cost reductions noted in the question itself have been the major drivers toward exploring and deploying this technology. These savings come not only from reducing the number of copies of data kept, but also by a factor reduction in the amount of data that gets replicated, moved or otherwise retained for long term reference. For the first time an economically viable alternative to tape exists with the collateral benefit of intelligent replication for business continuity and disaster recovery. The technology also helps customers keep more data on-line improving accessibility and restore times without increasing the risk of data loss if architected properly. The benefits to a business can be enormous and measured in $millions of benefit as compliance, regulation and competitive factors continue to make online access a requirement to reduce risk and improve customer responsiveness.
BM. Naturally, companies considering de-duplication are wary of losing vital data that’s falsely deemed duplicative. Is this an issue, and how can companies implementing data de-duplication technology guard against this eventuality?
KD. Protecting de-duplicated data – or exercising safe de-duplication – cannot be sufficiently attained by RAID. First-generation VTLs, disk-as-disk backup appliances and legacy systems with RAID-based de-duplication bolted on afterward can put data at risk. With data reduction ratios exceeding 20:1, hundreds of backup or archive images may depend on a single chunk. A lost chunk could render those images unreadable. A failed disk could be even worse, easily impacting the availability of all backup and archive data. This vulnerability underscores the need for next-generation storage with data protection built specifically to enhance resiliency for de-duplicated data.
NEC’s HYDRAstor grid-based architecture offers advanced data resiliency through its patent-pending Distributed Resilient Data (DRD) technology. Developed to deliver higher levels of data resiliency and availability, DRD doesn’t have the limitations and overhead of RAID, which becomes problematic as disk capacities increase. DRD’s default resiliency level of three protects against data loss – even with three disk or nodes failures – offering 300 percent more protection than RAID 5 with similar storage overhead and no performance degradation if a failure occurs and disk(s) need to be rebuilt.
JM. De-duplication technologies fall into two categories: byte-level check of every block of data; and intelligent hashing algorithms. Byte-level checking places a huge additional burden on the backup/archive process often negating cost benefits associated with de-dupe. Hashing algorithms deliver very high-levels of security against false positive de-duplication with a probability of data loss lower than multiple drive failures in a protected disk group. An architecture for large organizations while delivering real cost benefit.
Three key components should exist in any enterprise-class de-duplication solution. First, it must have highly available components embedded in the SAN/storage platform. Second, it must perform to the level of multiple PB customer backup needs, which will likely mean architecting the de-dup engine as a post-process task where customers can determine what back up volumes they want to de-dupe and when. And third, tight integration between the data protection method (i.e. VTL, file, block) and the de-dupe software eases integration and future product development.
PK. Incorrectly identifying a data block as being a duplicate is statistically insignificant for established hashing algorithms. Many good articles have been written in an attempt to put the proper perspective on the ‘hash collision’ issue. De-duplication engines built on established and vetted algorithms, such as the SHA-1 hash, are many magnitudes more reliable than the storage and transports that we entrust our data to on a daily basis. It is more likely to see a dual raid failure or another hardware issue that could result in a loss of data. Companies can use dual parity raid configurations and insist on replicating the de-duplication repository.
JL. The base technology used in the mainstream data de-duplication systems was built around methodology designed with the integrity of user data as the first concern. I’m in a good position to comment on this topic because the primary patent for variable-length, block-based data de-duplication is held by Quantum Corporation – that means the developers closest to the technology are part of a company that is an industry-leader specializing in backup, recovery, and archive. Incidentally, this data security is not just theoretical – today, there are thousands of users all over the world safely protecting petabytes of data with products that rely on data de-duplication techniques.
BM. Major disasters such as Hurricane Katrina and new laws enacted specifying data retention and retrieval policies for litigation purposes are making companies wake up to the stark realities associated with their disaster recovery capabilities. What advantages can data de-duplication technologies offer in terms of disaster recovery?
JM. De-duplication delivers the economics that allows data to be kept online longer while significantly reducing the network bandwidth requirements to achieve offsite replicated copies of backup and archive data. A longer, on-line retention period allows companies to retrospectively check for data anomalies and permits re-constructive data fixes long after data has been erroneously changed or lost.
By significantly reducing the amount of data that needs to be transmitted over a network to secure replicated copies makes disk based solutions a compelling option for backup and archive. Data can be protected in at least two locations at a reasonable cost, without incurring the security and integrity issues associated with removable media.
JL. This is a very important question. When you write backup to conventional disk, you always need to carry out another step to provide site-loss protection, and as Katrina and the recent fires in Southern California remind us, disaster recovery protection is absolutely essential for critical data. Data de-duplication really helps this issue because it reduces the bandwidth that’s needed to transmit data over networks by 90 percent or more. That happens because most backup jobs only hold a small percentage of really new data – typically less than five percent. By linking replication with de-duplication, we can transmit an entire backup set over a network, but only have to move a few new blocks. That means that replication over standard WANs is, for the first time, a practical tool for DR – and users can create remote copies of data every day without having to transport tapes.
KD. Remote data replication for disaster recovery is often unrealistic and too costly when using traditional VTLs or disk appliances because volumes of data must be moved over constrained bandwidth. With HYDRAstor’s in-line data de-duplication, only unique and new data is sent over the WAN, which offers a 95 percent reduction in transferred data, making replication over a WAN realistic and much more cost-effective. In addition, tapes can be lost or damaged easily; are not always accessible; and do not allow businesses to easily comply with new compliance and e-discovery laws. By making disk-based backup and replication affordable, businesses can replace outdated disaster recovery solutions that use tapes, and instead implement a cost-effective solution for protection in the event of a disaster, as well as meet compliance and e-discovery needs.
PK. Data de-duplication solutions that scale to both the small remote office and large data centers allow a unified solution for replicating data into a disaster recovery site. This can be done without a wholesale change in backup practices or procedures, reducing the cost and time for implementation. The elimination of shipping physical tapes avoids many risks of loss as well as the need for costly encryption solutions. In addition, with the repository on hand at both locations, recovery can begin immediately.
BM. Over the past three years, the de-duplication market has grown from nothing to $100 million and analyst firm The 451 Group expects this market to reach $260 million by the end of 2007. What do you think will be the key developments over the next 12 months? Are there any challenges that need to be overcome for the market to really take off?
JL. One of the most important needs is to have the technology become available in true enterprise level solutions and to have it effectively integrated with the other elements needed for comprehensive data protection strategy. De-duplication is great –but it can’t do everything. Companies also need conventional D2D backup for some jobs, they need tape for long-term retention, and they need encryption for security. And they have to match the right technology with the access and retention requirements for different data types and points in its lifecycle. We think what is really allowing de-duplication to begin entering the main stream is the fact that vendors with mature, multiple solution sets are integrating the technology into their offerings, providing common management tools, and supporting it with an experienced, unified service organization.
PK. The key de-duplication developments over the next 12 months will include more storage intelligence to better utilize lower cost storage, improved data processing algorithms, and the ability to data mine the storage repositories. Current challenges to de-duplication technology are primarily hardware related. Scalability and performance of any de-duplication technology are dependant on the storage and server capabilities. Companies should look for solutions that cluster servers within a single repository to ensure the long-term scalability within their environment. Clustering not only allows for fault tolerant configurations, but it also allows the hardware requirements to be spread out across multiple servers while still utilizing a single repository.
JM. There are many de-duplication products available, each delivering similar data reduction performance. While de-duplication software itself will continue to be commoditized it will be the complementary integration of the software with the storage platform, the complete solution that will drive meaningful differentiation of delivered user benefit.
Given today’s explosive data growth in the enterprise the density, power consumption and operating costs of the platform hosting the de-duplication software will become critical selection criteria. The storing of long term data (persistent data) has unique demands unique such as quick and easy data access and retrieval, self monitoring and healing both for data as well as infrastructure, extended product life (7-10 years) to eliminate frequent data migrations, floor space constraints and power and cooling efficiencies. Scalability and flexibility will be key architectural considerations as remote office and central store data protection services are integrated.
KD. The de-duplication market explosion is due to overwhelming data growth that will continue to grow exponentially, and the lack of new technologies to better manage their data. In 2008, storage companies will invest more in R&D to deliver technologies that help customers reduce stored data and create improved, more cost-effective ways to manage, protect, access and scale their storage systems.
Today’s biggest issue is siloed de-duplication (where de-duplication is done in separate backup and archive storage systems). The scaling capacity of HYDRAstor (a single instance scales to 10PBs of capacity with 14,000 MB/sec throughput) supports one backup and archive system, meaning that data de-duplication is done once across the entire system for both data types. NEC will further HYDRAstor’s de-duplication capabilities through geographically distributed grid storage and by moving de-duplication into primary storage to address overall data growth and management at its source. By geographically distributing HYDRAstor nodes over multiple sites and dialing up DRD resiliency, companies can change the paradigm on disaster recovery.
Changing the nature of data protection
According to research firm the 451 Group, the emergence of data de-duplication technology is changing the fundamental nature and economics of data protection for enterprise IT organizations. In particular, it is at the forefront of a major shift in the way organizations deploy and maintain their backup infrastructure. Software that strips out the redundancies that exist in traditional data backups, aligned with low-cost disk storage systems, is for the first time allowing organizations of all sizes to store and maintain an ever-increasing amount of their backup data ‘online’. This has multiple significant advantages. In particular, it allows organizations to better meet the challenges caused by data growth, the increasing compliance and legal discovery burden, zero tolerance for data loss and application downtime, and the need to locate and recover the right data as quickly as possible.
Some of the more significant players in the storage market have been caught off guard by the emergence of de-duplication. Some have already acquired technologies to close the gap, and more will follow, while other small and large players are in the process of developing their own technology. The next 12-18 months will see the de-duplication landscape change as these players come to market.