Monday, February 09, 2015

What is TRIM and why do SSDs need it?

Note: This was originally written as an email explaination. I didn't bother to clean it up much before posting it.

The reason for this whole SSD TRIM issue is due to differences between SSDs and HDDs, and attempts to remain insanely backwards-compatible.

Since I'm writing this for everyone, please don't get upset if I'm telling you something you already know... to the late 1980s!

PC IDE hard drives originally used a type of addressing system called "CHS". This stood for "Cylinders, heads, sectors". This described the physical location of data on the disk.

In detail:

  • Cylinders were the individual tracks on each disk. Unlike a CD or phonograph where there is a single spiral track going from the center to the edge, hard drives have closed circular loops, similar to the layers of a onion when viewed from a flat cross-section. Since each platter of a hard drive has the same general layout, e.g. Track #1 is directly above and/or below Track #1 on all the other platters, it was best to describe "Track #so-and-so on any side" as a cylinder.
  • Heads. Each hard drive platter has it's own set of read-write heads, generally two per platter (one per side), although a few odd drives only used one side of each platter. This value instructed the drive which platter and side to look for data on. 
  • Sectors. The smallest unit of data on a hard drive is a sector. These can be any number of sizes, from 128 bytes per sector, up to 16K per sector. The most common value for this is 512 bytes per sector, as that's what PC floppy disks used. 



However, this is a complicated way of accessing data on a drive, so this was replaced later by LBA - Logical Block Addressing. LBA reduced the former three dimensional addressing system for a one-dimensional block address - basically all of the drive's sectors are now just one very long tape, from the perspective of the BIOS and OS, and the physical layout of the sectors on disk no longer matters. The big thing, though, is that hard drives still used 512 sectors, or at least claimed to, using some internal trickery that's completely invisible to the outside world

Modern SSDs (and many modern HDDs) will report their physical sector size as 512 bytes, for the sake of backwards compatibility (there are some BIOSes as late as 2008 that will fail to recognize any hard drive or SSD that reports it's sector size as anything other than 512 bytes.) However, their actual physical sector size is generally 4096 bytes.

Standard EEPROMs erase their data one machine word at a time (8-32 bits, depending on the data bus width.) Flash EEPROM, such as that used in SSDs, is erased in "pages", which are generally 4096 bytes. This means that to change a single 512 byte sector stored on a SSD with flash memory, the SSD's controller reads the entire 4096 byte page, swaps out a single 512 byte chunk with the new sector data from the OS, erases the entire page (as you can't erase smaller than this on flash memory), and then writes back 512 changed bytes, plus 3584 bytes that was unchanged!

And that leads us to another issue: Flash memory cells can only sustain so many erasure cycles. They eventually reach a state where the stored charge is too much for the high voltage erase pulse to erase, and the old data is permanently burned in.

So, SSDs don't just do this read/modify/erase/write, you see, they try their damnest to NOT erase. So instead, it writes the logical sectors to ANOTHER PAGE on flash, mark the "old" location as "unused" (it won't erase it right away, but it will schedule it for erasure at some later time, like when the OS hasn't made a request in five minutes, or something of the sort.) This process is known as "wear levelling" and is intended to extend the useful life of SSDs.

Also, filesystems like FAT32 and NTFS don't store file location on disk by sector address, they group sectors in "clusters", which are always a power-of-two count of sectors. It's a trick to help reduce filesystem fragmentation.

The problem further then is that, e.g. you write a bunch of small files to your SSD, they might use one or two logical sectors of an eight sector page. But your SSD is smart, it knows that later if it needs to combine these two "used" sectors with two "used" sectors from a "distant" part of the drive, or some other combinations of the sort, into a single flash page, if it helps the SSD avoid erasing any one flash memory page too many times.

And now where TRIM becomes /sorely/ needed: Because file deletion doesn't do anything to the actual disk locations where the file data was actually stored, and simply removes the file's information from the filesystem's "table of contents", the SSD doesn't know that the data was actually deleted. It doesn't know that it can stop copying that data around each time it has to remap the data on the flash memory; sometimes it'll try to combine two pages whose combined "used" sector count is higher than 8; this means that the leftover sectors are then merged with another block, if there's still leftovers, it keeps merging until there aren't any more leftover "sectors". But this means that the flash memory is being written more than once in order to complete a write of what the OS thought was just one single sector. This problem is known as write amplification and it is the cause of the poor performance of older SSDs, and also quickly reduces their useful lives.

But TRIM solves this. TRIM provides an industry standard ATA protocol command for instructing a SSD that a logical data sector has indeed been deleted by the user or OS, and that the data no longer needs be meticiuously copied and maintained (and can safely be erased and treated as empty.)