was successfully added to your cart.

Five Guidelines as to the Best Ways to Implement Deduplication

Over the last few months I have been talking to a number of end-users about their implementations of deduplication technology. In the process of doing so, they have provided me with valuable insight into how they are implementing deduplication when using disk-based targets that deduplicate data. Based upon that feedback it appears that most are adhering to the following five guidelines as they implement deduplication in their environments.

6 is the magic number. I have mentioned before that a ratio of 6:1 is the deduplication ratio that companies need to achieve in order to financially justify deploying deduplication in their environment. This has been confirmed by recent conversations and is nothing new. What is somewhat new is that the 6:1 ratio is also becoming the magic number for those companies who are looking to move from backing up to raw or native disk to deduplicating that data stored on disk. Backing up data to disk started gaining traction about 5 – 6 years and now even those companies are abandoning backup to raw disk in favor of deduplication.

Analyze the data before it is backed up. Smaller environments may not care if they backup a few application servers with data that does not deduplicate particularly well. Large enterprises are not being so callous. Having hundreds of TBs or even PBs of data under their management, they are being very judicious in terms of analyzing their data before they attempt to store it on a solution that deduplicates it. Two key attributes they examine before they back it up include checking to see if the file is encrypted or compressed and is it an audio, image or video file. How long the data is retained after it is backed up also is a variable that is taken into consideration.

NAS is becoming the preferred interface of deduplication targets.  There are three reasons being cited for using NAS instead of LUNs or a virtual tape library (VTL.)

  • First, the combination of deduplication and the size of today’s file systems result in deduplication targets with storage capacities that are almost infinite in size (hundreds of TBs or even PBs of logical capacity.) This eliminates concerns about LUNs filling up and backup jobs failing or the backup software needing to manage VTL interfaces.
  • Second, Symantec’s Open Storage API (OST) and Data Domain Boost are delivering as good or better performance than backing up to native disk over FC.
  • Third, multiple concurrent backup jobs can be streamed at the same time to a NAS interface which is more difficult to accomplish when backing up data to raw disk.

If using NAS but the OST or DD Boost option is unavailable, use the NFS protocol. In one environment where neither OST nor DD Boost was an option, the user first tried backups using CIFS and then using NFS. He found that using NFS he got double the throughput that he achieved using CIFS.

Replication of deduplicated data is not yet as widespread as vendors may lead you to believe.
Almost every if not every provider of deduplication solutions offers a replication option and many users admittedly own that feature. However many of the users that I talk with are either using replication in only a very limited deployment or are just using it in test. This is not to imply that there are not users out there who are successfully using replication on a wide scale but I have personally not talked to very many who are.

Jerome M. Wendt

About Jerome M. Wendt

President & Founder of DCIG, LLC Jerome Wendt is the President and Founder of DCIG, LLC., an independent storage analyst and consulting firm. Mr. Wendt founded the company in November 2007.

One Comment

  • Great article, Jerome! I would like to add a few points.
    1. Security: You don’t want to compromise on security when you are thinking about implementing deduplication. The most critical systems with confidential data need to use secure data streams. Target based deduplication solutions are not suitable in this use case as encrypted streams do not dedupe well at the target.
    2. Replication is recommended in deduplication solutions. With deduplication, you are keeping all duplicate content in one basket. Any corruption will affect more than one backup. It is recommended to keep an additional copy onsite and/or offsite by making use of replication.
    Disclaimer: I work for Symantec; however things I post in non-Symantec portals are my own views.
    Warm regards,

Leave a Reply