A guide to understanding the hidden algorithms that manage the data in our everyday world, from smartphones to cloud apps. We look at which ones perform faster—and why.
Block translation layers handle much of our data. These algorithms are hidden away inside our SSDs, the storage in our phones, or the systems that store Dropbox files from a few years ago that you’ve probably forgotten about. They transform difficult-to-use devices like NAND flash or shingled disks into well-behaved ones, supporting the rewritable block interface that our file systems know and love. But these translation layers are good for something besides making weird chips and disks safe for file systems. We’re using them in the cloud, creating virtual disks on top of S3 object storage. To explain why, we’ll take a detour through translation layers, S3 object storage, and disk consistency models first.
We’re using them in the cloud, creating virtual disks on top of S3 object storage.
A block translation layer provides a simple rewritable block interface over something else that doesn’t. The most widely known examples are Flash Translation Layers (FTLs), used almost everywhere flash memory is used. NAND flash itself is difficult to use: although it’s divided into pages about the size of a disk block, and the pages can be read independently, the similarities end when we get to writes. Hundreds of these pages are grouped into erase units, and the pages in a unit must be written one at a time, in order, until it’s full and can’t be rewritten until all of them are erased in a single operation. A flash translation layer accepts disk-like block read and write requests, and uses out-of-place writes, a dynamic translation map, and garbage collection to implement them on top of flash. Since it can’t overwrite existing data, each write goes to a new location and updates a logical-to-physical map used for reads. The old location is now invalid—i.e., garbage—and when enough garbage accumulates, the remaining data is copied out of an erase unit so that it can be erased and made available for new writes.
But why is S3 storage like NAND flash, and why does it need a translation layer? At first glance S3 looks like a file system: variable-length objects have names and hold data, and reads are allowed with arbitrary byte offsets and lengths. Like flash, however, writes are different. S3 objects must be written in a single operation,1 which either creates a new object or replaces an existing one. Once you’ve written an object you can read it or delete it, but the only way to modify it is to rewrite the entire thing. Why is S3 so limited? In a nutshell, because it’s easier that way,2 because of issues of replication and consistency.
So why are we going to so much trouble to create a virtual disk over S3? The answer comes back to local caching, and in the end to yet more issues of consistency.
Consistency vs. Performance
Cloud storage systems like AWS S3 or Ceph replicate data on multiple machines for reliability, and they are designed to transparently handle failure of any of these machines without affecting the user. One of the hardest parts of this failure handling is keeping these replicas consistent with each other. For instance, if replica A is down when I make a change to replicas B and C, but then comes back up, I risk getting different data on each read depending on which replica it is routed to. Avoiding this requires mechanisms like locks and write-ahead logs, which add complexity and subtract performance, and it gets even worse when you go from simple replication to erasure coding. In contrast, when you create a new write-once object, consistency is simple: if you have a copy of it, then you know it’s the right one. Overwrites are a bit trickier, but not by much, mostly because S3 makes very few promises about when you’ll actually see the new copy.
In the open source world, Ceph is the most widely used cloud-scale storage system, and it supports both write-once and rewritable abstractions. At the lowest layer it has a pool of Object Storage Devices (OSD) storing rewritable objects; unlike S3 objects, these really do work like files. The Ceph RADOS Gateway (RGW) provides an S3 object service over these OSDs, splitting large S3 objects into smaller fixed-sized Ceph objects, writing multiple smaller objects in parallel for higher throughput. Although OSDs provide mechanisms to modify these objects safely, RGW never uses them, and so never pays the performance price of write-ahead logging and other mechanisms for preserving consistency.
The Ceph virtual disk, RADOS Block Device (RBD), takes advantage of Ceph’s rewritable objects by splitting a virtual disk image into smaller fixed-size Ceph objects, and translating disk block reads and writes into reads and writes of the corresponding object byte ranges. In contrast, creating a virtual disk over S3 requires something that looks a lot like a flash translation layer: new writes go to new S3 objects and update a translation map, and any remaining live data in old S3 objects is garbage collected before the object is deleted.
So why are we going to so much trouble to create a virtual disk over S3? The answer comes back to local caching, and in the end to yet more issues of consistency. As high-speed NVMe drives become more and more affordable, it becomes very tempting to use a local cache for virtual disks, as these local IOPS are far cheaper than equivalent performance in a shared storage cluster. Unfortunately if you do this wrong, the price you pay might be your file system.
Modern file systems are crash consistent: they order their writes in such a way that the file system is unlikely to be corrupted by a crash,3 typically by using a write-ahead log or journal for metadata updates such as directory entries and allocation bitmaps. To achieve this ordering, while still using asynchronous writes for high performance, file systems (and the fsync system call) use commit barriers—operations like the SCSI synchronize cache command, which guarantee that all preceding writes will be performed before any following ones.
A simple local cache with asynchronous write-back (there are three available in the kernel, and several other ones) can easily be combined with a virtual disk such as RBD, resulting in tremendous boosts in performance. Everything will be fine as long as neither the local SSD nor the virtualization host itself fail permanently, but if they do, things get messy. When this happensall that’s left is the remote virtual disk image.4 The bcache documentation describes the likely result for a cache that ignores commit barriers: “you will have massive filesystem corruption, though ext4’s fsck does work miracles.”
One alternative is to use a write-through cache, but that sacrifices much of the speed advantage of a local cache. The other alternative is to use a cache that preserves commit barriers, several of which are described in the literature. This second approach works great for workloads with few commit barriers; some that we’ve measured in the lab have one or two barriers per gigabyte written, and would be unaffected. Other workloads (SQLite is a prime example) send commit barriers writes after every few writes to the disk, and show little improvement over simple write-through.
By using a translation layer we sidestep this problem entirely. After logging writes to local SSD for durability in the case of recoverable crashes, our system batches a sequence of writes into a single object, gives it a sequence number (embedded in the object name), and writes it to the back end. If multiple object writes are outstanding when the system crashes, it’s possible that a random subset of them will fail to complete. However unlike RBD (or iSCSI, QCOW2 over NFS, etc.), we still have the old data, and we can decide which updates to apply and which to discard. In particular, we examine the sequence numbers and find the first gap: all updates before this gap are applied to the volume, and any objects following that sequence number gap are discarded. This preserves commit barrier semantics: if we keep a write following a commit barrier, then any write before that barrier is either in the same object, or in a preceding one that is guaranteed to exist by our recovery rule.
We’ve implemented a prototype of a translation layer over S3, split into a kernel device mapper and a Golang-based user-level daemon, and are testing it extensively. We hope to deploy a version of this as a pilot storage pool in the Mass Open Cloud (massopen.cloud) later this year.
1Or a multi-part upload, but the result is the same.
2In engineering, “easier” often means “cheaper”, “more reliable”, or even just “possible.”
3Having used Linux since kernel 1.0 and the ext2 file system, I can attest that this was not always the case, much to my occasional distress.
4Note that most of these caches were designed to cache local hard drives, where this scenario was unlikely.