Anyone who is close to backup recognizes that some types of data deduplicate better than others. However trying to translate that understanding of the environment into meaningful backup policies is almost impossible since it is both complicated and time consuming to successfully implement. Using the new Sepaton VirtuoSO platform, it is able to choose the best form of deduplication for each backup stream on the fly. In this third part of my interview series with Sepaton’s Director of Product Management, Peter Quirk, we discuss how its VirtuoSO platform detects the nature of incoming backup data and then automatically invokes the best deduplication method to deduplicate the data.
Jerome: Can you elaborate on how the VirtuoSO platform evaluates incoming application data and then selects the best method in which to deduplicate it?
Peter: The default behavior of the system is to attempt to do inline deduplication on incoming data, unless the data type is specified in our policy engine as to avoid doing inline deduplication on that data.
As the inline engine samples the first inrush of data from the data source, it looks at the success rate of finding hashes for the incoming chunks of data against the hash dictionary. If it is getting a reasonable hit rate, it concludes that the deduplication method it is using is appropriate for doing continued inline deduplication. If it gets zero matches, it might say, “Hmm, this data is either so unique that I have never seen it before, or it is not meant to be inline. This is data that would be better deduplicated using post-process.” In this case it will defer the deduplication to the post-process phase.
Another thing that happens is it may be ingesting data and everything is looking good. But then the inline detects that there is a region of data within the backup that is really unique and does not hash well against existing data. It will mark that range for post-processing, simply bypass it and leave that data for post-process deduplication.
Now there is a whole class of data that you know will be better deduplicated in post-process mode. So we can set those up at the policy engine to say, “Do not attempt to inline deduplicate that data. Always post-process it.”
This has two advantages. One, if you bypass inline deduplication, you can provide very consistent ingest rates. At this point, you are just writing the data to the back end in a very predictable fashion. Inline deduplication on anyone’s system has a variable ingest rate.
As long as the data change rate is low, you are writing relatively little data to the back end. But as the change rate in the data increases or the system gets data types that it has never seen before, it has to write a lot more IO to the back end, and the ingest rate drops. If your goal is to achieve a constant, predictable ingest rate, you might select post-process as the preferred deduplication approach.
Another reason to use post-process deduplication is if you are doing multi-streamed, multiplexed Oracle RMAN backups. They do not work particularly well using inline deduplication. You are better off deferring that data to the post-process engine where we can do byte level deduplication, which will be more efficient than any method of inline deduplication and really reduced the size of the data on disk.
The last case is turning deduplication off altogether. This applies to situations where you have got pathological data such as encrypted data that might have come from an Oracle secure backup. You need to protect it or get it off of its primary storage onto another media for technological diversity or even geographic diversity by means of replication. But you know you cannot deduplicate it as deduplicating it will not save any space. We can handle that workload by designating that data type so it does not deduplicate at all.
Jerome: You mentioned the term “reasonable” in the context of “rate hit.” Can you elaborate on how your system arrives at a “reasonable rate hit”?
Peter: An example use case might be backing up a set of home directories which come through the inline deduplication engine. They deduplicate very nicely every night. But then one day a system administrator adds a bunch of users to the home director hierarchy whose names sort early in the alphabet.
The next time I do a backup, I have effectively prepended a whole set of new data to the backup that was not there in the prior backups. That prepend pushes all of the data down from a matching point of view. It is also highly likely that this data does not lie on hash boundaries which will result in a lower deduplication ratio than I might expect on that next backup.
This can trigger our post-process deduplication engine to look for matches because it uses a sliding window. This sliding window can find the match between today’s backup and yesterday’s backup in the files that existed in both of them. It is very effective at doing that.
This is a classic case where the VirtuoSO platform automatically invokes post-process in regions of a backup that did not result in particularly good deduplication ratios, or gave us no deduplication ratio. It did not detect any similarities so it had a deduplication ratio of one. However when we invoke post-processing, it will check to see if it can find some similarities between this and other data.
In part I of this interview series, we discuss how databases and virtual machines (VMs) are just beginning to take full advantage of the benefits that disk offers as a backup target.
In part II of this interview series, we discuss what features Sepaton brought forward from its existing S2100 product line and what new features its VirtuoSO platform introduced.
In part IV of this interview series, we discuss the challenges of backing up Oracle environments and what new options the VirtuoSO platform offers to simplify and ease those challenges.