Red Hat Research Quarterly

Faster hardware through software

Red Hat Research Quarterly

Faster hardware through software

about the author

Gordon Haff

Gordon Haff is a Technology Advocate at Red Hat, where he works on emerging technology product strategy, writes about tech trends and their business impact, and is a frequent speaker at customer and industry events. His books include How Open Source Ate Software, and his podcast, in which he interviews industry experts, is Innovate @ Open.

Article featured in

Red Hat Research Quarterly

August 2021

In this issue

Researchers have tested several techniques for using software to get the most out of hardware. Find out about three promising projects that indicate the direction of this quickly changing field. 

It used to be simple to make computer workloads run faster. Wait eighteen months or so for more transistors consuming the same amount of power, maybe some tweaks in the fabrication material, some thoughtful balancing of frequency, core counts, interconnects, and cache sizes, and you were good to go. I’m trivializing a lot of difficult engineering work on both the hardware and software side to be sure. But the essential point is that the processor and other components largely got faster in a way that was mostly transparent to software.

With the slowdown of increasing transistor density provided by Moore’s Law and the end of the static power density offered by Dennard Scaling, this free (or at least relatively cheap) lunch is ending. Software needs to take advantage of any useful clues the hardware provides to help it run more efficiently, and the hardware itself needs to more intelligently manage its transistor budget.

The horizontal stack described by former Intel CEO Andy Grove, in which the processor and operating system were largely isolated from each other, is giving way to a world in which hardware and software are increasingly co-designed. In this article, I take a look at ongoing research that starts to break down the hard abstractions between software and hardware.

Breaking down abstractions

The caching scheme and logic used between the DRAM and NVDIMM memory require kernel knowledge to maximize memory bandwidth and minimize memory latency.

Can adding a single mechanism to Linux® enable huge performance gains? The Symbiote project led by Tommy Unger, a PhD student at Boston University (BU), enables an otherwise entirely standard Linux process to request elevation, that is, to run with full supervisor privileges. Elevation removes the hardware-enforced boundary between application and operating system code, allowing for aggressive co-optimization where desired. Importantly, the optimization methodology can be incremental. Before modification, Symbiotes are indistinguishable from Linux processes and are fully compatible with all the standard Linux interfaces. During optimization, programmers can settle on hybrids that partially utilize the Linux kernel, but also may specialize hot paths like system calls and interrupt paths. In the extreme, a Symbiote can fully take over the system paths and leave Linux behind—in this sense you might think of Linux acting as a bootloader for a library operating system.

Symbiotes completely relax the application-application and application-kernel protection domain. They can run virtualized or bare metal. Unlike unikernels and library operating systems, Symbiote does not prescribe a fundamental change to the familiar virtual address space layout of a process and the Linux operating system. A Symbiote can run alongside other standard Linux processes, and it can fork(). One guiding principle for Symbiote design is keeping changes within the application executable (as opposed to making changes to the Linux kernel) whenever possible. For example, if a programmer wants to specialize the page fault handler, they can memcpy() the Interrupt Descriptor Table (IDT) into the application address space, modify it there to point to a user-defined handler, and swing the Interrupt Descriptor Table Register to point to this new IDT. We target lines of code changes to the Linux kernel to be in the hundreds.

Symbiotes allow for fast incremental optimization. Consider optimizing the Redis server. Running a profiler (see Figure 1) shows significant time is spent in the write() syscall. Elevating the process to run as a Symbiote allows for substituting syscalls for low latency (and low processor perturbation) call instructions. Further, removal of the protection boundary allows the application to call directly into kernel internals. It is easy to prototype application-specific specialization by calling progressively deeper into kernel paths. On Redis’ use of the write() syscall, for example, the programmer can call directly into ksys_write(), skipping the syscall handler. Or they could push further into new_sync_write(), skipping the virtual dispatch in the VFS layer. Making aggressive use of application-specific knowledge, they might call directly from the Redis server app into tcp_sendmsg(), skipping resolution of socket and protocol layers. This procedure allows for the rapid construction of highly optimized, one-off executables, and this effort can be partially amortized by building a shared set of optimization tools in a Symbiote library.

Figure 1. Redis server

A programmer may choose to have their Symbiote play nicely with the underlying kernel and other processes, but it can also be used to strip away (or modify) Linux policies. 

A process could take direct control of the interrupt and syscall paths (vectoring directly off a network interrupt to a handler in the app’s address space); modify network protocols; change scheduling, preemption, and power policies; modify kernel data structures; or allow for direct application control of devices. On the flip side, traditional kernel paths can be pulled into user space and extended with arbitrary code: consider that this allows for compile time optimization across the application-kernel boundary, or the replacement of a heuristic like a high-water mark with a neural network defined in a high-level programming language.

Even a lowly network card needs help

Intelligent performance tuning can even bring benefit to a device as seemingly simple as a Network Interface Card (NIC). That’s because even a “simple” NIC, such as the Intel X520 10 GbE, is complex, with hardware registers that control every aspect of the NIC’s operation, from device initialization to dynamic runtime configuration. The Intel X520 datasheet documents over 5,600 registers—far more than can be tuned using trial and error.

BU PhD student Han Dong’s project aimed to identify, using targeted benchmarks, those application characteristics that would illuminate mappings between hardware register values and their corresponding performance impact. These mappings were then used to frame the NIC configuration as a set of learning problems such that an automated system could recommend hardware settings corresponding to each network application. This, in turn, allows for new dynamic device driver policies that better attune dynamic hardware configuration to application runtime behavior. 

Dong specifically studied different operating packet processing structures and their implications for the energy management of network-oriented services. He conducted a detailed operating system-centric study of performance-energy trade-offs within this space. Two hardware “knobs” on modern Intel servers control these trade-offs: specifically, 1) Dynamic Voltage Frequency Scaling (DVFS) that throttles processor frequency to save energy and 2) the interrupt delay register on modern NICs to control network interrupt firing rate. Both of these hardware controls exist in the general context of methods that can throttle network application processing. 

To gain further insights into the effects of how slowing down can impact these trade-offs, he studied manually controlling these two knobs under different traffic loads of four network applications across two different operating system structures: a general purpose Linux kernel and a specialized packet processing operating system (akin to a unikernel or an application built using network acceleration libraries such as the Dataplane Development Kit).

Among the findings were that a pure network poll could result in better performance and energy consumption than being interrupt driven; the entire workload finished quickly without using sleep state idle modes. A specialized operating system path length can also reduce time spent in processing by improving application code instructions-per-cycle (IPC), even with application-intensive workloads. This creates additional headroom to use slowing down processor and network interrupts to further reduce energy use. This specialized stack can more aggressively slow down processing to save energy by up to 76 percent over a general purpose operating system, with no noticeable impact on overall performance.

Optimizing data analytics clusters

Another performance optimization project comes out of BU, Northeastern University, and the Mass Open Cloud. The Kariz cache prefetching and management project focuses on the performance of data analytics clusters that usually segregate data into large multi-tenant data lakes, which are often relatively low performance relative to local storage.

Like some of the other research described, Kariz takes advantage of information that the system is already surfacing. In this case, it’s I/O access information that already exists for use in scheduling and coordinating distributed workers. Spark-SQL, Hive, and PIG all collect this information in the form of a dependency Directed Acyclic Graph (DAG) identifying inputs and outputs for each individual computation. Given future access information (e.g. job DAGs), Kariz determines which datasets to make available in cache by either prefetching or retention. Kariz determines which data to prefetch or evict, and when to do so.

Kariz also revisits some common assumptions. For example, it asserts that hit ratio and performance are not necessarily directly related, input files don’t need to be cached in their entirety, and the limits of back-end storage bandwidth are a relevant metric.

Making memory less of a bottleneck

Other research at the Red Hat Collaboratory at Boston University has focused on kernel techniques that optimize memory bandwidth while keeping latency predictable. 

Recently, memory density has increased vastly, thanks to Non-Volatile Dual In-Line Memory Modules (NVDIMMs). But memory bandwidth has long been one of the most limiting factors on overall system performance; CPUs can execute dozens or even hundreds of instructions in the time it takes to access memory just one time. Therefore the larger but slower NVDIMMs coexist with the faster DIMMs in a system, where they often serve as a main memory cache. The caching scheme and logic used between the DRAM and NVDIMM memory require kernel knowledge to maximize memory bandwidth and minimize memory latency.

As a result, a project between BU and Red Hat is exploring how the Linux kernel can take advantage of the newest CPU hardware features and system memory topologies. NVDIMMs running in memory mode (they can also operate as fast storage) is the near-term future of computers and must be optimized. Currently the Linux kernel has no way of evenly distributing the pages of NVDIMM memory throughout the DRAM cache, and this results in difficult-to-predict memory bandwidth and latency. A technique known as page coloring will be investigated and evaluated. A significant amount of work has been done for Non-Uniform Memory Access (NUMA) placement to reduce the number of remote (higher latency) memory accesses in more traditional CPU interactions with main memory, but NVDIMM optimizations will likely have to take different approaches.

The most recent result from this area of research is presented in  “E-WarP: A system-wide framework for memory bandwidth profiling and management” by Renato Mancuso, Assistant Professor in the Department of Computer Science at BU; Parul Sohal, a PhD candidate at BU; and Red Hat Distinguished Engineer Uli Drepper. 

It proposes an Envelope-aWare Predictive model, or E-WarP for short. It’s a methodology and technological framework to: 

  • analyze the memory demand of applications following a profile-driven approach,
  • make realistic predictions of the temporal behavior of workloads deployed on CPUs and accelerators, and
  • perform saturation-aware system consolidation. 

The goal is to provide the foundations for workload-aware analysis of real-time systems. 

The road ahead

Using advanced software techniques to get the most out of hardware, often dynamically, has been the focus of this article. However, as noted in the introduction, there’s an increasingly symbiotic relationship between software and hardware, in that the software can do the best job when the hardware surfaces need information.

Processor hardware is also increasingly implementing dynamic optimizations on its own. Power management is the best-known (and most long-standing) example, but we see more far-reaching optimizations coming online in newer processors. 

For example, Intel’s 3rd Gen Intel Xeon Scalable processor (Ice Lake) has several features in this vein. Intel Speed Select Technology (SST) offers a suite of capabilities to allow users to reconfigure the processor dynamically, at runtime, to match the workload by tuning the frequency, core count, and thermals. Intel Resource Director Technology enables monitoring and control of shared resources to deliver better quality of service for applications, virtual machines (VMs), and containers. 

With the evolution of a feature like SST from something that had to be set in the BIOS at boot time to something that can be dynamically changed at runtime, we see just one way in which software and hardware can work together to maximize the efficiency of a running program.

More to come

We’ll continue to probe these developing areas of research and technology in future issues. Unikernels complement the operating system structure work described in this article. We plan to dive deeper into the processor optimizations I’ve touched on here. And we’ll continue to keep tabs on the ongoing research around Field Programmable Gate Arrays (FPGA), flexible chips that can be programmed again and again with different code paths for different workloads. One ongoing project aims to use machine learning to control a newly customizable version of the GNU C Compiler (GCC) to automatically determine optimization pass ordering for FPGA targets specifically, and thereby improve performance as compared to existing proprietary C-to-FPGA methods.

Software and hardware architectures haven’t been this interesting for a while.


Acknowledgements

The author would like to thank Red Hat’s Larry Woodman and Uli Drepper, BU Professor of Electrical and Computer Engineering Orran Krieger, and BU PhD students Tommy Unger, Ali Raza, Parul Sohal, and Han Dong for their assistance with this article.

More like this