Over the last few months I have been talking to a number of end-users about their implementations of deduplication technology. In the process of doing so, they have provided me with valuable insight into how they are implementing deduplication when using disk-based targets that deduplicate data. Based upon that feedback it appears that most are adhering to the following five guidelines as they implement deduplication in their environments.
6 is the magic number. I have mentioned before that a ratio of 6:1 is the deduplication ratio that companies need to achieve in order to financially justify deploying deduplication in their environment. This has been confirmed by recent conversations and is nothing new. What is somewhat new is that the 6:1 ratio is also becoming the magic number for those companies who are looking to move from backing up to raw or native disk to deduplicating that data stored on disk. Backing up data to disk started gaining traction about 5 – 6 years and now even those companies are abandoning backup to raw disk in favor of deduplication.
Analyze the data before it is backed up. Smaller environments may not care if they backup a few application servers with data that does not deduplicate particularly well. Large enterprises are not being so callous. Having hundreds of TBs or even PBs of data under their management, they are being very judicious in terms of analyzing their data before they attempt to store it on a solution that deduplicates it. Two key attributes they examine before they back it up include checking to see if the file is encrypted or compressed and is it an audio, image or video file. How long the data is retained after it is backed up also is a variable that is taken into consideration.
NAS is becoming the preferred interface of deduplication targets. There are three reasons being cited for using NAS instead of LUNs or a virtual tape library (VTL.)
- First, the combination of deduplication and the size of today’s file systems result in deduplication targets with storage capacities that are almost infinite in size (hundreds of TBs or even PBs of logical capacity.) This eliminates concerns about LUNs filling up and backup jobs failing or the backup software needing to manage VTL interfaces.
- Second, Symantec’s Open Storage API (OST) and Data Domain Boost are delivering as good or better performance than backing up to native disk over FC.
- Third, multiple concurrent backup jobs can be streamed at the same time to a NAS interface which is more difficult to accomplish when backing up data to raw disk.
If using NAS but the OST or DD Boost option is unavailable, use the NFS protocol. In one environment where neither OST nor DD Boost was an option, the user first tried backups using CIFS and then using NFS. He found that using NFS he got double the throughput that he achieved using CIFS.
Replication of deduplicated data is not yet as widespread as vendors may lead you to believe. Almost every if not every provider of deduplication solutions offers a replication option and many users admittedly own that feature. However many of the users that I talk with are either using replication in only a very limited deployment or are just using it in test. This is not to imply that there are not users out there who are successfully using replication on a wide scale but I have personally not talked to very many who are.