The use of data reduction technologies such as compression and deduplication to reduce storage costs are nothing new. Tape drives have used compression for decades to increase backup data densities on tape while many modern deduplicating backup appliances use compression and deduplication to also reduce backup data stores. Even a select number of existing HDD-based storage arrays use data compression and deduplication to minimize data stores for large amounts of file data stored in archives or on networked attached file servers.
The challenges of using these two technologies change when they are implemented in high performance environments. The more predictable data access patterns with lots of redundant data that exist in archive, backup, and, to some extent, file serving environments are replaced in high performance environments with applications that potentially have highly random data access patterns where data does not deduplicate as well. Capacity reductions of production data are not as significant (maybe in the 2-3x range) as in backup which can achieve deduplication ratios of up to 8-20X or even higher.
Aggravating the situation, there is little to no tolerance for performance interruptions in the processing of production data – raw or deduplicated. While organizations may tolerate the occasional slow periods of deduplication performance for archive, backup and file servers data stores, consistently high levels of application performance with no interruptions are the expectations here.
Yet when it comes to deduplicating data, there is a large potential for a performance hit. In high performance production environments with high data change rates and few or no periods of application inactivity, all deduplication must be done inline. This requires the analysis of incoming data by breaking packets of data apart into smaller chunks, creating a hash and comparing that hash to existing hashes in the deduplication metadata database to determine if that chunk of data is unique or a duplicate.
If the array determines a chunk of data is a duplicate, there is also a very small chance that a hash collision could occur. Should the all-flash array fail to detect and appropriately handle this collision, data may be compromised.
These expectations for high levels of data integrity and performance requires large amount of cache or DRAM to host the deduplication metadata. Yet all-flash storage arrays only contain fixed amounts of DRAM. This may limit the maximum amount of flash storage capacity on the array as it makes no sense for the array to offer flash storage capacity beyond the amount of data that it can effectively deduplicate.
These all-flash array capacity limits are reflected in the results of the most recent DCIG 2014-15 Flash Memory Storage Array Buyer’s Guide. Of the 36 all-flash array models evaluated, only 42 percent of them could scale to 100 TB or more of flash capacity. Of these models that could scale to more than 100 TB, they:
- Did not support the use of data deduplication at the time the Guide was published
- Did not publicly publish any performance data with deduplication turned “On” implying that they recommend turning deduplication “Off” when hosting performance sensitive applications
- Use scale-out architectures with high node counts (up to 100) that are unlikely to be used in most production environments
The need to scale to 100 TBs or more of flash storage capacity is quickly becoming a priority. HP reports that already 25% of its HP 3PAR StoreServ 7450 all-flash arrays ship with 80TBs or more of raw capacity as its customers want to move more than just their high performance production data from HDDs to flash. They want to store all of their production data on flash. Further, turning deduplication off for any reason when hosting high performance application on these arrays is counter intuitive since these arrays are specifically designed and intended to host high performance applications. This is why, as organizations look to acquire all-flash storage arrays to host multiple applications in their environment, they need to look at how well they optimize both capacity and performance to keep their costs under control.