was successfully added to your cart.

The SSD Garbage Collection Problem Explained In Depth by WhipTail CTO James Candelaria – Part 1

A component of DCIG’s blog that has been absent for the last year or so are blog entries that are interviews with executives of data storage companies that DCIG considers thought leaders in their respective spaces. Today DCIG begins to rectify that by publishing the first part of an interview that I recently did with WhipTail Technologies Chief Technology Officer, James Candelaria, an emerging provider of SSD storage solutions. In this and subsequent blog entries, I will publish excerpts from my interview with James.

In this first blog entry, Candelaria and I discuss one of the biggest problems precluding the wide spread adoption of SSDs by enterprises: how to minimize and ideally avoid the massive performance penalty associated with repacking data on SSDs that is better known as garbage collection. In this blog entry, Candelaria describes the issue in great detail.

Ben: James, thanks for taking time to speak with me today about this issue. I know because of WhipTail’s position in the SSD space, you are intimately acquainted with this particular SSD issue. So to kick off our conversation and for the benefit of DCIG’s readers, could you describe in detail this SSD challenge and what steps WhipTail has taken to address it?

James: Ben, thank you and I would be happy to do so. The issue with SSDs has to do with optimizing the placement and storage of data after data has already been written to on a block on an SSD.

This has to be done at the controller level which requires that the controller repack the contents of a block, which is a set of cells that have to be erased together simultaneously. To repack the contents of that block, the contents of the block must first be put into a buffer, the change made, the entire cell erased, and then these changes pushed all the way back down to the cell. This incurs a massive performance penalty.

To avoid this massive performance penalty that this action incurs, storage providers currently implement over provisioning on a spare that you can’t see and that is already pre-erased. So in essence today’s SSD providers try to hide this activity from you, the end-user.

But eventually if you use up every cell on an SSD, all of the SSDs (includes the spares) will be dirty. At this point the SSD drive ends up going into garbage collection mode to figure out which pieces are still valid, which pieces are not, and then moves data around.

This again incurs a massive performance penalty as well as an endurance problem. These drives have to do this garbage collection on their own in the face of small random write requests.

So what WhipTail does differently is it intercepts all incoming data. To do so, WhipTail implements a log structured block translation layer. This log structured block translation layer essentially intercepts all IO regardless of its size and then places it in a small buffer that is precisely the size of the erase block of the array underneath it.

So if WhipTail has a 24 drive array and, let’s just say for argument sake, each erase block in each drive is 2 MBs, our write block size will end up being roughly 48 MBs. Then we have to subtract some for RAID (WhipTail runs parity RAID) and then subtract some for a hot spare. So we eventually end up with between 44 and 48 MB write blocks on a big array.

That buffer will get ejected from our stack all at once which basically gives WhipTail two advantages. First, WhipTail never does write amplify. In other words, WhipTail never takes a 4K write IO, and submits it to media where the media ends up having to perform that wasteful operation I just mentioned.

By avoiding this action, WhipTail reduces the number of times that a small random write IO will actually cause flash erasure which enhances SSD cell endurance immediately. It also enhances random write performance because the random write performance number is governed by how quickly you can actually write to the media. In this case WhipTail is writing data to the buffer so it can acknowledge immediately after the data has been staged in the buffer which then facilitates getting all of that data to media in one large chunk.

These two operations increase both random write performance and endurance. Here are some specifics. WhipTail’s endurance numbers have been rated at over seven years of endurance on a standard MLC based drive that has 5,000 PE cycles.

This is based on an overwrite ratio of roughly 3x per day. So WhipTail has basically taken write amplification all the way down from what most manufacturers advertise at roughly 10x write amp, to darn near 1:1 write amplification. Using the random write performance number, we tend to get 250,000 random 4K IOs per appliance.

What WhipTail does that is also important is that it always writes at a stripe boundary which means it never has to do a read in order to do a write such as I may have to do if using RAID 5 or RAID 6. This means that the SSD drives remain extremely happy and the performance stays up because it does not IO amplify.

In the next blog entry in this interview series with Candelaria he will explain how WhipTail optimizes SSD performance while minimizing the deficiencies of MLC flash.

In the third blog entry in this series, I discuss with Candelaria how WhipTail deals with variances between each SSD manufacturer’s hardware and firmware.

In the fourth installment in this series, Candelaria explains how and why WhipTail uses software RAID in its SSD appliance.

In Part V of this series, James and I discuss the hardware and software supported by WhipTail and why FCoE and iSCSI trump Infiniband in today’s SSD deployments.

Ben Maas

About Ben Maas

Senior Analyst for DCIG. Linux Kool-Aid Drinker. Twins Groupie. Fascinated by anything with silicon wafers.


  • Andrew Vogan says:

    I find this article to be confusing. The overwrite ratio of 3x per day is a write amplification and 3x per day is not that good in a client space. Writing in stripes is good. The article says that WhipTail never has to do a read to perform a write. What if the host workload is full range random 512 byte writes? Does the indirection table contain enough memory to manage worst case granularity maximum element count ? If not then some form of GC would be involved.
    250,000 iop/sec is very good, however, I expect that to be a deep queued number which is relevant in enterprise products but not in client where the average queue depth is between 1 and 2. I would be interested to know the random 512 byte, full range, performance at Qd=1.
    Andrew Vogan

  • Ben Maas Ben Maas says:

    Good questions Andrew.
    In this context James is talking about drive endurance. If a cell has lifespan of 5,000 PE cycles and it gets overwritten an average of 3x day (a high number for most cells) the drive will last about seven years. Whiptail is attempting to make sure that if a cell needs to be overwritten 3x a day it is overwritten exactly 3x (1:1) not 10x (a 3:10 write amplification).
    James and I did circle back later in the interview and talk about just the problem you bring up. James explains the reasons behind this and then how Whiptail is solving that problem later in the interview and it will be in the third installment in this series.

  • Thanks Ben. Andrew, what Ben explains is correct by mitigating write amplification we extend the useful life of stand MLC media dramatically.
    With regards to 512byte random writes, our block translation layer does issue fill-in READ requests to fill in any blocks that fall short of a full 4K block. This is necessary as you surmised that keeping a LBA->PBA translation with 512 byte granularity would require a substantially larger RAM footprint.
    With regards to the performance question, 250,000 IOPs is achieved with a QD of 40, however at a QD of one we can still achieve slightly north of 200,000 IOPS @ 4K. @ 512b single threaded you are looking at 104,000 IOPS regardless of queue depth. This is due to the read-fill in as described above.
    I hope this answers your questions.

  • Ernest Griffin says:

    I am currently using the Whiptail array. I can say that GC has been our bane of existence on this hardware. We have allocated 25% of the array as a “buffer” (1.1TB) and it is still needing more. We set the GC at 30% and we are getting 25,000ms response times on both reads and writes. I would expect that Whiptail thinks they have found a diamond on paper, but they produced a lump of coal.

Leave a Reply