"The online business magazine at the heart of international business management news..."
New Account

The Magazine

Issue 11

E-magazine
  • Previous Issues

Blog

Spencer Green
Chairman, GDS International

Sales and the 'Talent Magnet'

A lot is written about being a ‘Talent Magnet’, either as a company, or as President. It’s all good practice – listen, mentor, reward, provide clear goals and career maps. Good practice for the employer, but what about the employee?
24 May 2011

Identifying the Hidden Risks of Data Deduplication

By Karen Dutch, General Manager, Advanced Storage Products Group, NEC Corporation of America

NEC Corporation | www.hydrastor.com


Data de-duplication has gained significant industry attention. Known also as commonality factoring, non-redundant storage and duplicate data reduction, de-duplication identifies and stores only unique data. If a data string has already been stored in a system, it is referenced by a pointer rather than stored a second time.

De-duplication benefits are compelling: backup systems, by their nature, backup the same data over and over. By implementing de-duplication, IT organizations benefit from lower acquisition costs and lower total cost of ownership (TCO) while gaining the ability to cost-effectively replicate data for disaster recovery, even in bandwidth-constrained environments.

While first-generation disk-based backup products may improve backup operations, their high acquisition costs and TCO have limited their adoption. The inescapable reality is these products require the purchase of significant amounts of expensive disk capacity: organizations wanting to store a month’s worth of backups on a VTL need to a purchase a VTL with four to five times the disk capacity of their production environment.

Reducing disk usage is where data reduction technology changes the game: all the benefits of disk-based backup can be achieved with significantly less new disk capacity and at a lower cost.

Data de-duplication technologies
Even though substantial portions of data do not change since their last backup, traditional backups continue to backup data, while de-duplication algorithms identify redundant data and store them only once. After initial storage of a data string, all future backups with that string refer to the original by using a data pointer.

Over time, de-duplication can achieve 20:1 data reduction where only 5% of incoming data is actually stored: 100TBs backup residing on only 5TBs of physical disk. De-dupe ratios vary depending upon the nature of data and the organization’s backup strategy, with organizations performing daily full backups achieving greater reduction ratios faster than those employing incremental backups. The business value of data reduction includes the following:

  • Lower cost of acquisition – raw disk requirements are significantly reduced, and products incorporating data reduction have a significantly lower price per-TB.
  • Lower total cost of ownership (TCO) – as data reduction increases over time, physical disk space doesn’t grow as rapidly.
  • Cost-effective remote replication – only unique de-duplicated data is remotely replicated; with 90-95% reduction in transferred data, replication is now realistic.

Does RAID provide enough protection?
Most first-generation disk-based products use RAID for disk protection; RAID 5 is able to recover from a single disk failure with storage overhead of about 20%. However, with the second disk failure within the same RAID group, data within that group are unrecoverable. To improve data resiliency, some disk-based backup products use RAID 6 instead, with double parity for fault protection. Although RAID 6 can recover from two drive failures in the same RAID group, its storage overhead is 35-40%, a significant increase.

While de-duplication benefits are clear, “bolt-on” implementations in first-generation products introduce significant risks. Although RAID technologies may provide sufficient protection for non-de-duplicated environments, they are insufficient for de-duplicated environments.

The first risk is performance degradation: de-duplication and software-based compression are CPU-intensive processes, reducing performance by 40% or more with throughput reductions up to 65%. Such performance penalties largely negate the advantage of disk over tape.

However, the most critical risk associated with RAID in de-duplicated environments is the risk of lost backups and unavailable recoveries. When a single block may contain data used by hundreds of backups, all associated backups could be rendered unrecoverable if that block is lost. The impact of a failed disk is even worse, potentially impacting all backup availability on first-generation disk-based solutions.

Many arrays with RAID 6 do not recommend reading or writing during recovery because the recursive recovery process results in significant performance degradation; interrupting the process could result in unrecoverable data. As with RAID 5, RAID 6 systems suffer a write penalty due to parity calculations, decreasing overall performance. The parity penalty is substantial because double parity calculations do not easily map to standard CPU hardware and require table lookups.1

Moving beyond RAID, the ideal solution would be able to survive more than two disk failures, have no write performance penalty, not experience degraded performance during a disk rebuild, and use no more storage overhead than RAID 5. NEC has spent the last four years designing just such a solution called HYDRAstor.

H. Peter Arvin: “The mathematics of RAID 6.”

HYDRAstor: next-generation disk-based data protection
HYDRAstor is a next-generation disk-based data protection solution designed from the ground up to address key backup and archive challenges without the risks and operational limitations of current point products. HYDRAstor’s unique grid architecture enables high performance scaling from 200 MB/sec to over 14,000 MB/sec aggregated throughput in a single instance.

HYDRAstor capacity also scales easily and non-disruptively from TBs to PBs. A single system can store months or years of backup and archive data for less than typical tape systems. HYDRAstor’s scalability and affordability are a result of its unique grid-based storage architecture and its patent-pending DataRedux™ and Distributed Resilient Data™ (DRD) technologies. Unlike first-generation VTLs and disk appliances, HYDRAstor is the only solution that fully addresses IT’s requirement for enterprise data resiliency in a de-duplicated environment.

Grid-based storage architecture
HYDRAstor’s grid-based storage architecture differentiates it from other disk-based backup systems. Combining HYDRAstor’s intelligent management software, DynamicStor™, with best-of-breed industry standard servers allows NEC to deliver unmatched data resiliency and virtually unlimited performance and capacity scalability. Unlike other solutions that place de-duplicated data at risk, HYDRAstor introduces advanced data resiliency through DRD.

Delivered as a turnkey solution, HYDRAstor enables users to apply additional resources when needed and without labor-intensive configuration, tuning or intervention. HYDRAstor is a fully self-managed, self-tuning, self-healing system, with performance and capacity able to be increased independently through the deployment of Accelerator Nodes and Storage Nodes.

Accelerator nodes – scaling performance
Accelerator Nodes (ANs) are industry standard servers optimized for performance (Figure 1). ANs connect to one or more backup servers and support both CIFS and NFS, with each AN capable of delivering more than 100 MB/sec throughput. ANs operate with a coherent distributed file system to spread workloads across available processors and ensure no node is a single point of failure. If one or more ANs fail, remaining ANs shoulder the aggregate workload. ANs scale non-disruptively to increase performance, with 10 ANs providing over 1000 MB/sec of aggregated throughput.

Storage nodes – scaling capacity
Storage Nodes (SNs) are also industry standard servers that have been optimized for storage capacity (Figure 2). SNs provide disk capacity for backup and archive data and are connected to Accelerator Nodes via a private network managed by DynamicStor. DynamicStor virtualizes storage within and across multiple SNs to create one logical capacity pool.

HYDRAstor’s innovative virtualization eliminates all provisioning tasks. With HYDRAstor, storage administrators do not need to create and size LUNs, volumes, or file systems. As SNs are added, HYDRAstor automatically load-balances existing data across available capacity to optimize performance and utilization.

DataRedux technology – storage efficiency
HYDRAstor’s DataRedux data reduction begins with NEC’s “chunking” technology which separates data into data chunks. If the data chunk has been previously stored, a data pointer is deployed. If the data chunk has not been stored, NEC’s unique data resiliency techniques (described later) are applied, and the chunk is stored.

De-duplication efficiency is driven by data-chunking algorithms: products with no chunking or with fixed-size chunking typically experience lower data reduction ratios. In the first case, changing one byte in a file results in the whole file being stored again. In the second, any chunk following a chunk with a changed byte would also be stored. In contrast, HYDRAstor uses a sophisticated algorithm and variable chunk size to maximize finding redundant data. Only chunks with real data changes are identified as unique, then physically compressed before being stored.

While duplicate elimination and physical compression create performance bottlenecks in first-generation products, HYDRAstor’s grid architecture distributes the load across nodes, avoiding performance degradation.

Distributed resilient data technology – recovery assurance
To address RAID risks and limitations with data reduction, NEC has introduced unique Distributed Resilient Data (DRD) technology. DRD protects data from three or more disk failures without write penalty, without performance degradation during a rebuild, without long rebuild times, and with less storage overhead than RAID.

After a unique data chunk is identified, it is broken into nine data fragments. Depending on the desired resiliency level – the default setting is three, but users can “dial-up” the level if desired – DRD computes parity fragments to match resiliency levels based on data, not parity. Unlike first-generation solutions using RAID that must read, recalculate, and then rewrite parity, HYDRAstor does not incur this substantial “parity penalty.”

DRD intelligently distributes the 12 fragments (nine data fragments + three parity fragments) across available SNs. At default resiliency, any data chunk can be recreated using any nine of its 12 fragments, thus as many as three fragments can be lost without jeopardizing data integrity. In a minimum recommended HYDRAstor configuration with four SNs, data are protected against one SN failure (five data disks) or the failure of any three disk drives across multiple SNs.

In larger HYDRAstor systems (12 or more SNs), each SN stores no more than one of the 12 fragments comprising a data chunk, with default resiliency protecting data against three or more SN failures. For each resiliency level, HYDRAstor automatically calculates the appropriate number of data fragments, creates the necessary number of parity fragments, and distributes all for maximum data resiliency.

Unlike RAID-based implementations, HYDRAstor is accessible during a disk rebuild without degraded performance for backup and restore. Because fragments are distributed throughout the grid, ongoing operations are not impacted. Failed components are automatically “discovered” and rebuilds automatically initiated in the background (Figure 4).

With its additional resilience, HYDRAstor would seem to require tremendous storage overhead, but that is not the case. Overhead for HYDRAstor’s default resiliency is 3/12 or 25%, only slightly more than RAID 5 (about 20%) and substantially less than RAID 6 (around 35-40%). However, HYDRAstor provides substantially higher resiliency, 300% more than RAID 5 and 50% more than RAID 6.

No single point of failure
First-generation products typically have multiple points of failure, with failure of a single component such as the motherboard, a RAID card, etc., possibly resulting in a whole system failure. The nearest HYDRAstor equivalent would be failure of an AN or SN, but as previously discussed, there would be no data accessibility impact due to HYDRAstor’s distributed grid architecture (Figure 4).

The fixed hash tables of first-generation de-duplication products also represent a single point of failure. If the hash tables are lost, then data become unrecoverable. HYDRAstor has no such centralized tables, and because it distributes its hash tables across Storage Nodes, even the failure of multiple SNs would not result in hashing information loss.

HYDRAstor benefits
The unique benefits of HYDRAstor can be summarized as follows:

  • Affordable – HYDRAstor’s enterprise data reduction capabilities enable disk-based backup at price points comparable to tape.
  • Enhanced resilience – HYDRAstor’s Distributed Resilient Data (DRD) technology allows organizations to safely take full advantage of de-duplication without worrying about data loss and unavailability. DRD storage overhead is comparable to RAID 5, but provides 300% better data resiliency.
  • Unrestricted scalability – HYDRAstor’s Accelerator Nodes (ANs) and Storage Nodes (SNs) allow performance and capacity to be scaled independently and non-disruptively. No other product provides this simple, cost-effective infrastructure tuning ability.
  • No performance penalty – Unlike RAID-based systems, HYDRAstor does not experience performance degradations, which place data in first-generation products at risk during disk rebuild operations.
  • No single point of failure – HYDRAstor is delivered as a fully redundant, highly available, turnkey appliance with no centralized resources as single points of failure.

HYDRAstor offers the benefits of data reduction plus increased system availability not found in first-generation products. When it comes to protecting their organization’s critical data, enterprise data centers appreciate the superior capabilities of HYDRAstor.


More like this...