Red Hat Research Quarterly

Translation layers for the cloud: speeding storage performance

Peter Desnoyers

Peter Desnoyers is an Associate Professor in the Khoury College of Computer Sciences, which he joined in 2008. He is one of the founders of the Mass Open Cloud, a multi-institutional collaboration to develop new models for cloud computing, and serves on the steering committee. His research is focused on storage issues in operating systems, in particular the integration of emerging storage technologies such as flash and SMR disk into existing software infrastructures.

Related Projects

Mass Open Cloud (MOC): An open, distributed platform enabling AI/ML workloads

Article featured in

Red Hat Research Quarterly

May 2021

Download PDF

Subscribe now

In this issue

From the Director

Expanding the impact of open source

Hugh Brock

News

Telemetry Working Group looks at observability

Gordon Haff

News

Big data, security certification, and FPGAs: 2021 Red Hat Research Days have begun

Gordon Haff

News

Boston University and Red Hat partner to support the New England Research Cloud

Column

Planting the seeds for a blossoming research program

Matej Hrušovský

Interview

When one plus one makes more than two: how open source builds a bridge between universities and industry

Idan Levi

Feature

Demystifying real-time Linux scheduling latency

Daniel Bristot de Oliveira

Feature

Verifying programs that communicate with the environment

Henrich Lauko

Feature

Translation layers for the cloud: speeding storage performance

Peter Desnoyers

Feature

Combining experience with passion inspires a new mentorship program

Irit Goihman

Liora Milbaum

A guide to understanding the hidden algorithms that manage the data in our everyday world, from smartphones to cloud apps. We look at which ones perform faster—and why.

Block translation layers handle much of our data. These algorithms are hidden away inside our SSDs, the storage in our phones, or the systems that store Dropbox files from a few years ago that you’ve probably forgotten about. They transform difficult-to-use devices like NAND flash or shingled disks into well-behaved ones, supporting the rewritable block interface that our file systems know and love. But these translation layers are good for something besides making weird chips and disks safe for file systems. We’re using them in the cloud, creating virtual disks on top of S3 object storage. To explain why, we’ll take a detour through translation layers, S3 object storage, and disk consistency models first.

We’re using them in the cloud, creating virtual disks on top of S3 object storage.

A block translation layer provides a simple rewritable block interface over something else that doesn’t. The most widely known examples are Flash Translation Layers (FTLs), used almost everywhere flash memory is used. NAND flash itself is difficult to use: although it’s divided into pages about the size of a disk block, and the pages can be read independently, the similarities end when we get to writes. Hundreds of these pages are grouped into erase units, and the pages in a unit must be written one at a time, in order, until it’s full and can’t be rewritten until all of them are erased in a single operation. A flash translation layer accepts disk-like block read and write requests, and uses out-of-place writes, a dynamic translation map, and garbage collection to implement them on top of flash. Since it can’t overwrite existing data, each write goes to a new location and updates a logical-to-physical map used for reads. The old location is now invalid—i.e., garbage—and when enough garbage accumulates, the remaining data is copied out of an erase unit so that it can be erased and made available for new writes.

But why is S3 storage like NAND flash, and why does it need a translation layer? At first glance S3 looks like a file system: variable-length objects have names and hold data, and reads are allowed with arbitrary byte offsets and lengths. Like flash, however, writes are different. S3 objects must be written in a single operation,¹ which either creates a new object or replaces an existing one. Once you’ve written an object you can read it or delete it, but the only way to modify it is to rewrite the entire thing. Why is S3 so limited? In a nutshell, because it’s easier that way,² because of issues of replication and consistency.

So why are we going to so much trouble to create a virtual disk over S3? The answer comes back to local caching, and in the end to yet more issues of consistency.

Consistency vs. Performance

Cloud storage systems like AWS S3 or Ceph replicate data on multiple machines for reliability, and they are designed to transparently handle failure of any of these machines without affecting the user. One of the hardest parts of this failure handling is keeping these replicas consistent with each other. For instance, if replica A is down when I make a change to replicas B and C, but then comes back up, I risk getting different data on each read depending on which replica it is routed to. Avoiding this requires mechanisms like locks and write-ahead logs, which add complexity and subtract performance, and it gets even worse when you go from simple replication to erasure coding. In contrast, when you create a new write-once object, consistency is simple: if you have a copy of it, then you know it’s the right one. Overwrites are a bit trickier, but not by much, mostly because S3 makes very few promises about when you’ll actually see the new copy.

In the open source world, Ceph is the most widely used cloud-scale storage system, and it supports both write-once and rewritable abstractions. At the lowest layer it has a pool of Object Storage Devices (OSD) storing rewritable objects; unlike S3 objects, these really do work like files. The Ceph RADOS Gateway (RGW) provides an S3 object service over these OSDs, splitting large S3 objects into smaller fixed-sized Ceph objects, writing multiple smaller objects in parallel for higher throughput. Although OSDs provide mechanisms to modify these objects safely, RGW never uses them, and so never pays the performance price of write-ahead logging and other mechanisms for preserving consistency.

The Ceph virtual disk, RADOS Block Device (RBD), takes advantage of Ceph’s rewritable objects by splitting a virtual disk image into smaller fixed-size Ceph objects, and translating disk block reads and writes into reads and writes of the corresponding object byte ranges. In contrast, creating a virtual disk over S3 requires something that looks a lot like a flash translation layer: new writes go to new S3 objects and update a translation map, and any remaining live data in old S3 objects is garbage collected before the object is deleted.

So why are we going to so much trouble to create a virtual disk over S3? The answer comes back to local caching, and in the end to yet more issues of consistency. As high-speed NVMe drives become more and more affordable, it becomes very tempting to use a local cache for virtual disks, as these local IOPS are far cheaper than equivalent performance in a shared storage cluster. Unfortunately if you do this wrong, the price you pay might be your file system.

Modern file systems are crash consistent: they order their writes in such a way that the file system is unlikely to be corrupted by a crash,³ typically by using a write-ahead log or journal for metadata updates such as directory entries and allocation bitmaps. To achieve this ordering, while still using asynchronous writes for high performance, file systems (and the fsync system call) use commit barriers—operations like the SCSI synchronize cache command, which guarantee that all preceding writes will be performed before any following ones.

A simple local cache with asynchronous write-back (there are three available in the kernel, and several other ones) can easily be combined with a virtual disk such as RBD, resulting in tremendous boosts in performance. Everything will be fine as long as neither the local SSD nor the virtualization host itself fail permanently, but if they do, things get messy. When this happensall that’s left is the remote virtual disk image.⁴ The bcache documentation describes the likely result for a cache that ignores commit barriers: “you will have massive filesystem corruption, though ext4’s fsck does work miracles.”

One alternative is to use a write-through cache, but that sacrifices much of the speed advantage of a local cache. The other alternative is to use a cache that preserves commit barriers, several of which are described in the literature. This second approach works great for workloads with few commit barriers; some that we’ve measured in the lab have one or two barriers per gigabyte written, and would be unaffected. Other workloads (SQLite is a prime example) send commit barriers writes after every few writes to the disk, and show little improvement over simple write-through.

By using a translation layer we sidestep this problem entirely. After logging writes to local SSD for durability in the case of recoverable crashes, our system batches a sequence of writes into a single object, gives it a sequence number (embedded in the object name), and writes it to the back end. If multiple object writes are outstanding when the system crashes, it’s possible that a random subset of them will fail to complete. However unlike RBD (or iSCSI, QCOW2 over NFS, etc.), we still have the old data, and we can decide which updates to apply and which to discard. In particular, we examine the sequence numbers and find the first gap: all updates before this gap are applied to the volume, and any objects following that sequence number gap are discarded. This preserves commit barrier semantics: if we keep a write following a commit barrier, then any write before that barrier is either in the same object, or in a preceding one that is guaranteed to exist by our recovery rule.

We’ve implemented a prototype of a translation layer over S3, split into a kernel device mapper and a Golang-based user-level daemon, and are testing it extensively. We hope to deploy a version of this as a pilot storage pool in the Mass Open Cloud (massopen.cloud) later this year.

¹Or a multi-part upload, but the result is the same.
²In engineering, “easier” often means “cheaper”, “more reliable”, or even just “possible.”
³Having used Linux since kernel 1.0 and the ext2 file system, I can attest that this was not always the case, much to my occasional distress.
⁴Note that most of these caches were designed to cache local hard drives, where this scenario was unlikely.

SHARE THIS ARTICLE

Smarter AI, fewer resources: bringing cloud AI into real-time edge devices to unlock performance

Eshed Ohn-Bar

A new AI framework for edge systems overcomes the communication and energy obstacles that limit their use in real-time applications by integrating local and cloud decision-making while maintaining strong performance. Artificial intelligence (AI) models with vast and generalized knowledge are increasingly being integrated into everyday devices, from smartphones that provide personalized assistance to mobile robots […]

Interview

From Brno to Waco: On cross-cultural exchange, microservice evolution, and quality assurance

Matej Hrušovský

Pavel Tišnovský

RHRQ asked Brno research manager Matej Hrušovský and Red Hat quality assurance engineer Pavel Tišnovský to talk with long-time collaborator Tomáš Černý, a native of the Czech Republic now teaching at Baylor University in Waco, Texas. Prof. Černý was in Brno recently as part of his highly successful student research initiative, which brings Baylor students […]

Interview

Open source cybersecurity and the next generation of computer scientists

Mike Bursell

Václav Matyáš, Professor with the Centre for Research on Cryptography and Security at the Faculty of Informatics at Masaryk University.

Feature

Where will we find the data scientists?

Jennifer Wood

Universities play a primary role in developing data skills, but traditional education alone can’t close the skills gap fast enough. The mismatch between the widespread need for strong data skills and the current workforce is an obstacle for nearly every sector of the economy, which means no single sector can solve it. Collaborative partnerships among […]

Feature

Testing critical IoT systems to mitigate network disruptions

Miroslav Bureš

The Internet of Things brings new opportunities and new challenges for mission-critical applications where lives are at stake. Systematic testing can help. The Internet of Things (IoT) has significantly increased the capabilities of mission-critical systems in many domains. Integrated rescue systems, healthcare, defense, energy, and transportation benefit from using the IoT, enabling faster system reactions […]

Feature

Yuga: A tool to help Rust developers write unsafe code more safely

Sanjay Arora

Baishakhi Ray

Vikram Nitin

Some bugs in unsafe Rust arise from errors that are so easy to make that they are easily overlooked. Researchers have developed a new analyzer to find them. By Vikram Nitin, Anne Mulhern, Baishakhi Ray, and Sanjay Arora Rust, a programming language that did not exist just 10 years ago, is now well known and […]

Interview

Clouds on the horizon: shared cloud computing resources make research more accessible and more powerful

Heidi Dempsey

Dr. Michael Zink is Professor of Electrical and Computer Engineering at the University of Massachusetts, Amherst. In addition to publishing and teaching, Dr. Zink has participated in several projects providing distributed systems and virtual networks for research and education, including GENI and ExoGENI (2007-2021), Cloud Lab (2014-2021), and now the Open Cloud Testbed (OCT) since […]

Perspectives

Research perspectives: Focus on security, privacy, and cryptography

Lily Sturmann

RHRQ asked Lily Sturmann, a senior software engineer at Red Hat in the Office of the CTO in Emerging Technologies, to look back at the past few years of research in the area of security and privacy research and share her perspective on the future. She has contributed frequently to the Red Hat Next blog, […]

Column

Shared knowledge or private IP? That is the question

RHRQ interviewed Idan Levi, the Research Interest Group leader in Israel, to get his take on how university research intersects with the open source approach, from datasets and collaboration to security and data privacy.