Red Hat Research Quarterly

Faster hardware through software

Gordon Haff

Gordon Haff is a Technology Advocate at Red Hat, where he works on emerging technology product strategy, writes about tech trends and their business impact, and is a frequent speaker at customer and industry events. His books include How Open Source Ate Software, and his podcast, in which he interviews industry experts, is Innovate @ Open.

Article featured in

Red Hat Research Quarterly

August 2021

Download PDF

Subscribe now

In this issue

From the Director

Programmable networks, hardware—what’s next, programmable enterprises?

Hugh Brock

News

Europe RIG launched June 3, 2021

Matej Hrušovský

News

Red Hat joins IBM PhD Fellowship program

Marek Grác

News

Gender diversity on the rise among engineers

Matej Hrušovský

Feature

User authentication for open source developers: what do they use?

Agáta Kružíková

Milan Brož

Interview

The right idea at the right time: networking researchers use open source for real-world results

Toke Høiland-Jørgensen

Feature

Faster hardware through software

Gordon Haff

Feature

“When one teaches, two learn”: making the most of technical research mentorship

Matej Hrušovský

Lis Strenger

Feature

BigDataStack delivers with contributions from industry and university partners

Yosef Moatti

Oshrit Feder

Guy Khazma

Gal Lushi

Paula Ta-Shma

Luis Tomás Bolivar

Miki Kenneth

Josh Salomon

Column

Building better through research

Heidi Dempsey

Researchers have tested several techniques for using software to get the most out of hardware. Find out about three promising projects that indicate the direction of this quickly changing field.

It used to be simple to make computer workloads run faster. Wait eighteen months or so for more transistors consuming the same amount of power, maybe some tweaks in the fabrication material, some thoughtful balancing of frequency, core counts, interconnects, and cache sizes, and you were good to go. I’m trivializing a lot of difficult engineering work on both the hardware and software side to be sure. But the essential point is that the processor and other components largely got faster in a way that was mostly transparent to software.

With the slowdown of increasing transistor density provided by Moore’s Law and the end of the static power density offered by Dennard Scaling, this free (or at least relatively cheap) lunch is ending. Software needs to take advantage of any useful clues the hardware provides to help it run more efficiently, and the hardware itself needs to more intelligently manage its transistor budget.

The horizontal stack described by former Intel CEO Andy Grove, in which the processor and operating system were largely isolated from each other, is giving way to a world in which hardware and software are increasingly co-designed. In this article, I take a look at ongoing research that starts to break down the hard abstractions between software and hardware.

Breaking down abstractions

The caching scheme and logic used between the DRAM and NVDIMM memory require kernel knowledge to maximize memory bandwidth and minimize memory latency.

Can adding a single mechanism to Linux® enable huge performance gains? The Symbiote project led by Tommy Unger, a PhD student at Boston University (BU), enables an otherwise entirely standard Linux process to request elevation, that is, to run with full supervisor privileges. Elevation removes the hardware-enforced boundary between application and operating system code, allowing for aggressive co-optimization where desired. Importantly, the optimization methodology can be incremental. Before modification, Symbiotes are indistinguishable from Linux processes and are fully compatible with all the standard Linux interfaces. During optimization, programmers can settle on hybrids that partially utilize the Linux kernel, but also may specialize hot paths like system calls and interrupt paths. In the extreme, a Symbiote can fully take over the system paths and leave Linux behind—in this sense you might think of Linux acting as a bootloader for a library operating system.

Symbiotes completely relax the application-application and application-kernel protection domain. They can run virtualized or bare metal. Unlike unikernels and library operating systems, Symbiote does not prescribe a fundamental change to the familiar virtual address space layout of a process and the Linux operating system. A Symbiote can run alongside other standard Linux processes, and it can fork(). One guiding principle for Symbiote design is keeping changes within the application executable (as opposed to making changes to the Linux kernel) whenever possible. For example, if a programmer wants to specialize the page fault handler, they can memcpy() the Interrupt Descriptor Table (IDT) into the application address space, modify it there to point to a user-defined handler, and swing the Interrupt Descriptor Table Register to point to this new IDT. We target lines of code changes to the Linux kernel to be in the hundreds.

Symbiotes allow for fast incremental optimization. Consider optimizing the Redis server. Running a profiler (see Figure 1) shows significant time is spent in the write() syscall. Elevating the process to run as a Symbiote allows for substituting syscalls for low latency (and low processor perturbation) call instructions. Further, removal of the protection boundary allows the application to call directly into kernel internals. It is easy to prototype application-specific specialization by calling progressively deeper into kernel paths. On Redis’ use of the write() syscall, for example, the programmer can call directly into ksys_write(), skipping the syscall handler. Or they could push further into new_sync_write(), skipping the virtual dispatch in the VFS layer. Making aggressive use of application-specific knowledge, they might call directly from the Redis server app into tcp_sendmsg(), skipping resolution of socket and protocol layers. This procedure allows for the rapid construction of highly optimized, one-off executables, and this effort can be partially amortized by building a shared set of optimization tools in a Symbiote library.

A programmer may choose to have their Symbiote play nicely with the underlying kernel and other processes, but it can also be used to strip away (or modify) Linux policies.

A process could take direct control of the interrupt and syscall paths (vectoring directly off a network interrupt to a handler in the app’s address space); modify network protocols; change scheduling, preemption, and power policies; modify kernel data structures; or allow for direct application control of devices. On the flip side, traditional kernel paths can be pulled into user space and extended with arbitrary code: consider that this allows for compile time optimization across the application-kernel boundary, or the replacement of a heuristic like a high-water mark with a neural network defined in a high-level programming language.

Even a lowly network card needs help

Intelligent performance tuning can even bring benefit to a device as seemingly simple as a Network Interface Card (NIC). That’s because even a “simple” NIC, such as the Intel X520 10 GbE, is complex, with hardware registers that control every aspect of the NIC’s operation, from device initialization to dynamic runtime configuration. The Intel X520 datasheet documents over 5,600 registers—far more than can be tuned using trial and error.

BU PhD student Han Dong’s project aimed to identify, using targeted benchmarks, those application characteristics that would illuminate mappings between hardware register values and their corresponding performance impact. These mappings were then used to frame the NIC configuration as a set of learning problems such that an automated system could recommend hardware settings corresponding to each network application. This, in turn, allows for new dynamic device driver policies that better attune dynamic hardware configuration to application runtime behavior.

Dong specifically studied different operating packet processing structures and their implications for the energy management of network-oriented services. He conducted a detailed operating system-centric study of performance-energy trade-offs within this space. Two hardware “knobs” on modern Intel servers control these trade-offs: specifically, 1) Dynamic Voltage Frequency Scaling (DVFS) that throttles processor frequency to save energy and 2) the interrupt delay register on modern NICs to control network interrupt firing rate. Both of these hardware controls exist in the general context of methods that can throttle network application processing.

To gain further insights into the effects of how slowing down can impact these trade-offs, he studied manually controlling these two knobs under different traffic loads of four network applications across two different operating system structures: a general purpose Linux kernel and a specialized packet processing operating system (akin to a unikernel or an application built using network acceleration libraries such as the Dataplane Development Kit).

Among the findings were that a pure network poll could result in better performance and energy consumption than being interrupt driven; the entire workload finished quickly without using sleep state idle modes. A specialized operating system path length can also reduce time spent in processing by improving application code instructions-per-cycle (IPC), even with application-intensive workloads. This creates additional headroom to use slowing down processor and network interrupts to further reduce energy use. This specialized stack can more aggressively slow down processing to save energy by up to 76 percent over a general purpose operating system, with no noticeable impact on overall performance.

Optimizing data analytics clusters

Another performance optimization project comes out of BU, Northeastern University, and the Mass Open Cloud. The Kariz cache prefetching and management project focuses on the performance of data analytics clusters that usually segregate data into large multi-tenant data lakes, which are often relatively low performance relative to local storage.

Like some of the other research described, Kariz takes advantage of information that the system is already surfacing. In this case, it’s I/O access information that already exists for use in scheduling and coordinating distributed workers. Spark-SQL, Hive, and PIG all collect this information in the form of a dependency Directed Acyclic Graph (DAG) identifying inputs and outputs for each individual computation. Given future access information (e.g. job DAGs), Kariz determines which datasets to make available in cache by either prefetching or retention. Kariz determines which data to prefetch or evict, and when to do so.

Kariz also revisits some common assumptions. For example, it asserts that hit ratio and performance are not necessarily directly related, input files don’t need to be cached in their entirety, and the limits of back-end storage bandwidth are a relevant metric.

Making memory less of a bottleneck

Other research at the Red Hat Collaboratory at Boston University has focused on kernel techniques that optimize memory bandwidth while keeping latency predictable.

Recently, memory density has increased vastly, thanks to Non-Volatile Dual In-Line Memory Modules (NVDIMMs). But memory bandwidth has long been one of the most limiting factors on overall system performance; CPUs can execute dozens or even hundreds of instructions in the time it takes to access memory just one time. Therefore the larger but slower NVDIMMs coexist with the faster DIMMs in a system, where they often serve as a main memory cache. The caching scheme and logic used between the DRAM and NVDIMM memory require kernel knowledge to maximize memory bandwidth and minimize memory latency.

As a result, a project between BU and Red Hat is exploring how the Linux kernel can take advantage of the newest CPU hardware features and system memory topologies. NVDIMMs running in memory mode (they can also operate as fast storage) is the near-term future of computers and must be optimized. Currently the Linux kernel has no way of evenly distributing the pages of NVDIMM memory throughout the DRAM cache, and this results in difficult-to-predict memory bandwidth and latency. A technique known as page coloring will be investigated and evaluated. A significant amount of work has been done for Non-Uniform Memory Access (NUMA) placement to reduce the number of remote (higher latency) memory accesses in more traditional CPU interactions with main memory, but NVDIMM optimizations will likely have to take different approaches.

The most recent result from this area of research is presented in “E-WarP: A system-wide framework for memory bandwidth profiling and management” by Renato Mancuso, Assistant Professor in the Department of Computer Science at BU; Parul Sohal, a PhD candidate at BU; and Red Hat Distinguished Engineer Uli Drepper.

It proposes an Envelope-aWare Predictive model, or E-WarP for short. It’s a methodology and technological framework to:

analyze the memory demand of applications following a profile-driven approach,
make realistic predictions of the temporal behavior of workloads deployed on CPUs and accelerators, and
perform saturation-aware system consolidation.

The goal is to provide the foundations for workload-aware analysis of real-time systems.

The road ahead

Using advanced software techniques to get the most out of hardware, often dynamically, has been the focus of this article. However, as noted in the introduction, there’s an increasingly symbiotic relationship between software and hardware, in that the software can do the best job when the hardware surfaces need information.

Processor hardware is also increasingly implementing dynamic optimizations on its own. Power management is the best-known (and most long-standing) example, but we see more far-reaching optimizations coming online in newer processors.

For example, Intel’s 3rd Gen Intel Xeon Scalable processor (Ice Lake) has several features in this vein. Intel Speed Select Technology (SST) offers a suite of capabilities to allow users to reconfigure the processor dynamically, at runtime, to match the workload by tuning the frequency, core count, and thermals. Intel Resource Director Technology enables monitoring and control of shared resources to deliver better quality of service for applications, virtual machines (VMs), and containers.

With the evolution of a feature like SST from something that had to be set in the BIOS at boot time to something that can be dynamically changed at runtime, we see just one way in which software and hardware can work together to maximize the efficiency of a running program.

More to come

We’ll continue to probe these developing areas of research and technology in future issues. Unikernels complement the operating system structure work described in this article. We plan to dive deeper into the processor optimizations I’ve touched on here. And we’ll continue to keep tabs on the ongoing research around Field Programmable Gate Arrays (FPGA), flexible chips that can be programmed again and again with different code paths for different workloads. One ongoing project aims to use machine learning to control a newly customizable version of the GNU C Compiler (GCC) to automatically determine optimization pass ordering for FPGA targets specifically, and thereby improve performance as compared to existing proprietary C-to-FPGA methods.

Software and hardware architectures haven’t been this interesting for a while.

Acknowledgements

The author would like to thank Red Hat’s Larry Woodman and Uli Drepper, BU Professor of Electrical and Computer Engineering Orran Krieger, and BU PhD students Tommy Unger, Ali Raza, Parul Sohal, and Han Dong for their assistance with this article.

SHARE THIS ARTICLE

From Brno to Waco: On cross-cultural exchange, microservice evolution, and quality assurance

Matej Hrušovský

Pavel Tišnovský

RHRQ asked Brno research manager Matej Hrušovský and Red Hat quality assurance engineer Pavel Tišnovský to talk with long-time collaborator Tomáš Černý, a native of the Czech Republic now teaching at Baylor University in Waco, Texas. Prof. Černý was in Brno recently as part of his highly successful student research initiative, which brings Baylor students […]

Project Updates

Research Project Updates—October 2020

Faculty, PhD students, and US Red Hat associates in Israel are collaborating actively on the following research projects. This quarter we highlight collaborative projects at Technion University, Tel Aviv University, and The Interdisciplinary Center Herzliya. We will highlight research collaborations from other parts of the world in future editions of the Research Quarterly. Contact academic@redhat.com for more information on any project described here.

News

Sixteen Red Hat Collaboratory Research Incubation Award winners announced

Shaun Strohmer

Funding recipients will study AI in cloud operations, hardware stack innovations, performance improvements, and more. The Red Hat Collaboratory at Boston University recently announced the recipients of its first-ever Research Incubation Awards. Reviewers from BU faculty and Red Hat selected sixteen proposals to fund, including one large-scale, multi-year project and another five smaller-scale projects that […]

Feature

Moving ecological forecasting from supercomputer to cloud: why and how

Christopher Tate

New event-driven architecture enabled researchers to move the PEcAn platform to the New England Research Cloud and increase scalability. Near-term ecological forecasting can help communities make better decisions and prepare for extreme weather events and changes in the environment. Use cases include forecasts of infectious disease outbreaks, increases or declines in animal populations, or the […]

News

Publication highlights—August 2023

Red Hat Research collaborates with universities and government agencies to produce peer-reviewed publications that bring open source contributions along with them. These research artifacts illustrate the value that open industry-academia collaborations hold not just for participants, but for technological advancement across the field of computer engineering. This is a sampling of recent papers and conference […]

Feature

Optimizing Kubernetes service selection

Daniel Bachar

Is there a way to implement load balancing in multicluster environments that won’t increase resource usage? New research suggests the answer is yes. Multicloud providers and microservice-based applications across clouds are becoming increasingly popular. Organizations that use them enjoy the benefits of high availability, performance improvements, and cost effectiveness. However, as microservices communicate with each […]

Perspectives

Research perspectives: Focus on security, privacy, and cryptography

Lily Sturmann

RHRQ asked Lily Sturmann, a senior software engineer at Red Hat in the Office of the CTO in Emerging Technologies, to look back at the past few years of research in the area of security and privacy research and share her perspective on the future. She has contributed frequently to the Red Hat Next blog, […]

News

New research on eBPF and security begins at Karlstad University

Toke Høiland-Jørgensen

January 1, 2022, marked the official start of a new three-year research collaboration between Red Hat Research and Karlstad University around eBPF and security in the Linux kernel. eBPF is a technology that supports running sandboxed code in the running Linux kernel without having to change the source code of the kernel itself. PhD student […]

News

Red Hat Collaboratory at Boston University granting major awards

Shaun Strohmer

The Collaboratory solicited proposals from BU faculty for both large and small research projects to drive innovation for the open hybrid cloud. The Red Hat Collaboratory at Boston University has moved into a new phase of identifying and funding promising research projects, and the selection process is currently underway. The submission deadline was October 1, […]

Red Hat Research Quarterly

August 2021

Faster hardware through software

Gordon Haff

Red Hat Research Quarterly

August 2021

Faster hardware through software

Gordon Haff

Gordon Haff

Red Hat Research Quarterly

August 2021

Programmable networks, hardware—what’s next, programmable enterprises?

Europe RIG launched June 3, 2021

Red Hat joins IBM PhD Fellowship program

Gender diversity on the rise among engineers

User authentication for open source developers: what do they use?

The right idea at the right time: networking researchers use open source for real-world results

Faster hardware through software

“When one teaches, two learn”: making the most of technical research mentorship

BigDataStack delivers with contributions from industry and university partners

Building better through research

Researchers have tested several techniques for using software to get the most out of hardware. Find out about three promising projects that indicate the direction of this quickly changing field.

Breaking down abstractions

Figure 1. Redis server

Even a lowly network card needs help

Optimizing data analytics clusters

Making memory less of a bottleneck

The road ahead

More to come

Acknowledgements

From Brno to Waco: On cross-cultural exchange, microservice evolution, and quality assurance

Matej Hrušovský

Pavel Tišnovský

Research Project Updates—October 2020

Sixteen Red Hat Collaboratory Research Incubation Award winners announced

Shaun Strohmer

Moving ecological forecasting from supercomputer to cloud: why and how

Christopher Tate

Publication highlights—August 2023

Optimizing Kubernetes service selection

Daniel Bachar

Research perspectives: Focus on security, privacy, and cryptography

Lily Sturmann

New research on eBPF and security begins at Karlstad University

Toke Høiland-Jørgensen

Red Hat Collaboratory at Boston University granting major awards

Shaun Strohmer