Demystifying real-time Linux scheduling latency

This is the third of a series of three articles about the formal analysis and verification of the real-time Linux kernel.

Scheduling latency is the principal metric of the real-time variant of Linux, and it is measured using the cyclictest tool. Despite its practical approach and contributions to the current state of the art of real-time Linux, cyclictest has some known limitations. The main constraint arises from the opaque nature of the latency value provided by cyclictest. The tool only provides information about the latency value, without providing insights on its root causes. This fact, along with the absence of a theoretically sound description of the in-kernel behavior, raises some doubts about whether Linux can really support the “real-time” description.

A common approach in real-time systems theory is categorizing a system as a set of independent variables and equations that describe its integrated timing behavior. The first article of this series (see RHRQ 2:3) presented the thread synchronization model, which is composed of a set of formal specifications that define the behavior of the system. This article leverages the specifications of that model to discover a safe bound for the scheduling latency of Linux. It also uses findings from the second article of this series (see RHRQ 2:4), in which an efficient verification tool was developed, by using the same method to capture the values for variables that compose the scheduling latency on a real system and identify the root cause of high latency values.

From informal to formal

The latency experienced by a thread instance is, informally, defined as the maximum time elapsed between the instant in which it becomes ready while having the highest priority among all ready threads and the instant in which it is allowed to execute its own code after the context switch has already been performed.

A common approach in real-time systems theory is categorizing a system as a set of independent variables and equations that describe its integrated timing behavior. The first step in this approach is to define the task model of the system. In this work, the task model is composed of three levels of tasks: the NMI, the IRQs, and the threads. The system has a single NMI, a set IRQ = {IRQ₁, IRQ₂, …} of maskable interrupts, and a set of threads τ = { τ₁, τ₂, … }. The NMI, IRQs, and threads are subject to a scheduling hierarchy in which NMI always has a higher priority than IRQs, and IRQs always have higher priority than threads.

Given a threadτ_i at a given point in time, the set of threads with a higher priority than τ₁ is denoted by HP(τ_i). Similarly, the set of tasks with priority lower than τ_i is denoted by LP(τ_i). From the τ_ithread perspective, all IRQs and the NMI belong to HP(τ_i). Although the schedulers might have threads with the same priority in their queues, only one among them will be selected to have its context loaded and, consequently, start running. Hence, when scheduling, the schedulers elect a single thread as the highest priority one, with all other active threads belonging to LP(τ_i).

The tasks can also influence one another via synchronization primitives. For example, a thread can postpone the execution of an IRQ by temporarily masking interrupts, or it can defer the execution of another thread in HP(τ_i) by temporarily disabling the preemption. Moreover, the scheduling operation itself influences the thread execution timeline because it is not an atomic operation.

The complexity imposed by the different levels of tasks, the synchronization primitives, and the overhead involved in Linux threads’ scheduling makes informal language inadequate for this analysis. A formal thread synchronization model is required instead. The model, which enables the composition of specifications using a deterministic format, removes the ambiguity of natural language while enabling reasoning about the system in a more analytical manner.

In this work, we have translated the specifications of the thread synchronization model into a set of properties. We then leverage these properties in an analysis that derives a theoretically sound bound for scheduling latency.

A theoretically sound bound for scheduling latency

The scheduling latency experienced by an arbitrary thread τ_i in τ is the longest time elapsed between the time A, in which any job of τ_i becomes ready and has the highest priority, and the time F, in which the scheduler returns and allows τ_i to execute its code, in any possible schedule in which τ_i is not preempted by any other thread in the interval [A, F]. We begin determining which types of entities may prolong the latency of τ_i.

In real-time theory, any time a task in LP(τ_i) delays τ_i, τ_i is said to be blocked. When the task delaying τ_i is in HP(τ_i), τ_i is said to be suffering interference. These two forms of delay were analyzed separately, starting with the blocking. Each of these forms of delay was characterized by variables and equations whose definition supports the model. The main specification used in the definition of the latency bound is shown in Figure 1.

Blocking time
Blocking time (referred to as interference-free latency in the original paper) is characterized with the following equation¹:

L^IF ≤ max(D_ST, D_POID) + D_PAIE + D_PSD

Where:

L means the scheduling latency
IF means interference free
D means the delay of
- ST: the sched tail delay, which is the delay from the IRQs being disabled to cause the context switch, and the return from the scheduler
- POID: the longest preemption or IRQ disabled to postpone the scheduler
- PAIE: the longest time in which the preemption and IRQs are transiently enabled in the return of the preemption or IRQ enable, that will cause the scheduler execution to preempt the current thread
- PSD: the longest time in which the preemption is disabled to execute __schedule() function.

It is worth noting that a blocking delay is only caused by other threads, which by definition belong to the sets of lower priority tasks LP(τ_i). The delay caused by IRQs and NMI is all accounted for as interference.

Interference
Because there is no single way to characterize these workloads, defining the interference caused by the NMI and IRQs presents a challenge. So instead of defining a single best way to compute interference from interrupts, the interference from IRQs and NMI was defined as two functions in the theorem that define the latency. The function can then be selected according to the more accurate representation of the system.

The latency bound
The scheduling latency is then defined as the sum of blocking time and interference time, as in the following equation. The L in both sides of the equation is resolved by solving the equation until it converges on both sides.

L = max(D_ST, D_POID) + D_PAIE + D_PSD + I^NMI(L) + I^IRQ(L)

Figure 2 shows these variables in a timeline format, helping to illustrate the composition of the latency from A to F.

Rtsl: a latency measurement tool

As shown in the first article, it is possible to observe the thread synchronization model’s events using Linux’s tracing features. The obstacle is that the simple capture of these events using trace causes a non-negligible overhead in the system, both in CPU and memory bandwidth, representing a challenge for measuring variables in the microseconds scale. However, as demonstrated in the second article, it is possible to process these events in-kernel, reducing overhead. In this article, an efficient verification method was leveraged to develop the real-time latency measurement toolkit named rtsl. The toolkit architecture is presented in Figure 3.

**Figure 3.** Overview of the rtsl toolkit

The latency parser is a kernel module that uses the thread synchronization model’s kernel tracepoints to observe their occurrence from inside the kernel. The latency parser registers a callback function to the kernel tracepoints. The callback functions then pre-process the events, transforming them into one of the variables presented in Figure 2, only exporting them to the trace buffer when necessary. This reduces the overhead enough to enable the usage of the tool for measurements. For example, the record of these values adds only around two microseconds to the cyclictest measurements when compared to the same system running without trace.

As mentioned in the interference section above, there is no single best way to account for the delay caused by interrupts. Instead of proposing a single function to account for the interference, the rtsl toolkit processes the data from an interrupt using different functions, reporting the hypothetical result from each of them. Examples of functions are considering interrupts as periodic or using the sliding window algorithm, as shown in the experiments section below.

The processing of the data is done in user space, using perf as the interface. Perf is used to capture, store, and process the data in user space. The processing phase, named report, produces both graphical and textual output. The textual output shows the value for each of the variables that compose the latency and the hypothetical latency for each given interrupt function. An example of the output is shown in Figure 4.

The textual output serves to identify how much each variable contributes to the scheduling latency. This evidence can be used then to trace the specific variable, in such a way to identify the root cause of bad values. An example of this procedure is shown in the original paper.

**Figure 4.** perf rtsl output: excerpt from the textual output (time in nanoseconds)

Experiments

This section presents some latency measurements, comparing the results found by cyclictest andperf rtsl while running concurrently in the same system. The experiments were executed on two systems: a workstation and a server. The Phoronix test suite benchmark was used as a background workload to exercise different parts of the system. One sample of the results of the experiments is shown in Figure 5. The workload of each experiment is explained in the legend. The colored columns represent the different metrics. The first is cyclictest, the second is the interference-free latency, and the next four are the hypothetical latency based on the given interrupt interference function from rtsl. Consistently, the proposed approach found sound scheduling latency values higher than cyclictest could find in the same time frame. Considering interference curves such as the sliding window, the latency values are still in the microseconds scale, even on non-tuned general purpose hardware. When considering the highly pessimistic sliding window with oWCET (observed worst-case execution time)interference, the latency is bound to the single digit milliseconds, enough to justify Linux on a vast set of safety-critical use cases.

Final Remarks

Usage of real-time Linux in safety-critical environments, such as in the automotive and factory automation field, requires a set of more sophisticated analyses of both the logical and timingbehavior of Linux. In this series of articles, we presented a viable approach for the formal modeling, verification, and analysis of the real-time preemption mode of Linux. The definition of the latency bound was the primary goal of this research. However, the complexity of Linux required the support of a formal language to abstract the complexity of the code and the development of an efficient way to monitor the relevant events. With the bases set, the mathematical reasoning about the kernel behavior was evident, resulting in an analysis accepted by the academic community, who raised this limitation of Linux more than a decade ago.

In addition to the evident results, such as the efficient runtime verification and the mathematical demonstration of the bound of the scheduling latency, this research is motivating the development of new methods for Linux analysis. Examples of ongoing research range from the usage of temporal language for runtime verification to the application of probabilistic methods for the definition of values for the variables that compose the timeline of tasks on Linux. More details about the “Demystifying the real-time Linux scheduling latency” paper are available at bristot.me/demystifying-the-real-time-linux-latency/.

¹The development of the equations are not presented here because of space constraints. Please refer to the original paper for further information: “Demystifying the real-tIme Linux scheduling latency,” presented in the 32nd Euromicro Conference on Real-Time Systems (ECRTS 2020).

SHARE THIS ARTICLE

Feature

BigDataStack delivers with contributions from industry and university partners

Yosef Moatti

Oshrit Feder

Guy Khazma

Gal Lushi

Paula Ta-Shma

Luis Tomás Bolivar

Miki Kenneth

Josh Salomon

Data skipping and network performance improvement technologies prove their value in data-intensive applications.

Feature

Testing critical IoT systems to mitigate network disruptions

Miroslav Bureš

The Internet of Things brings new opportunities and new challenges for mission-critical applications where lives are at stake. Systematic testing can help. The Internet of Things (IoT) has significantly increased the capabilities of mission-critical systems in many domains. Integrated rescue systems, healthcare, defense, energy, and transportation benefit from using the IoT, enabling faster system reactions […]

Feature

Open Education Project tackles GPU scheduling and metrics visibility

Danni Shi

Enhancements to the education project highlight how research work on OPE drives advancements for many kinds of multitenant environments. The Open Education Project (OPE) continues to develop solutions for optimizing GPU resource usage in a multitenant environment. OPE, a project of Red Hat Collaboratory at Boston University, has long been a pioneer in making high-quality, […]

Feature

Changing the world, one lesson at a time

Matej Hrušovský

Why teaching more teachers is essential to computer science education.

Feature

Translation layers for the cloud: speeding storage performance

Peter Desnoyers

A guide to understanding the hidden algorithms that manage the data in our everyday world, from smartphones to cloud apps. We look at which ones perform faster—and why.

Feature

“When one teaches, two learn”: making the most of technical research mentorship

Matej Hrušovský

Lis Strenger

Research mentorships are the basic building block of productive industry-university relationships. We asked four mentors from around the globe to tell us about the challenges, rewards, and strategies of serving as a mentor. Linking a student’s research goals with the experience of a Red Hat software engineer is at the crux of the Red Hat […]

Feature

Building an intelligent multicluster scheduler with network link abilities

Clodagh Walsh

Ryan Jenkins

Simplify scheduling with an intelligent, multicluster-aware scheduler capable of automatically handling dependent Kubernetes resources and ensuring network connectivity between distributed services. Scheduling resources across a multicluster environment is not a trivial task. As part of a recent cloud-to-edge research collaboration, P2CODE, a team of engineers based out of Red Hat’s Waterford office in Ireland took […]

Feature

Making machine learning accessible across disciplines

Marek Grác

Machine learning has been driving research breakthroughs in many fields. Now there is an open source curriculum designed to help non-specialists build the skills they need to use it. Machine learning is an increasingly important competency in a growing number of fields. Biochemists are using it to create models for protein engineering. Economists are using […]

Feature

Faster hardware through software

Gordon Haff

Researchers have tested several techniques for using software to get the most out of hardware. Find out about three promising projects that indicate the direction of this quickly changing field. It used to be simple to make computer workloads run faster. Wait eighteen months or so for more transistors consuming the same amount of power, […]

Red Hat Research Quarterly

Demystifying real-time Linux scheduling latency

Red Hat Research Quarterly

Demystifying real-time Linux scheduling latency

Daniel Bristot de Oliveira

Red Hat Research Quarterly

May 2021

This is the third of a series of three articles about the formal analysis and verification of the real-time Linux kernel.

From informal to formal

A theoretically sound bound for scheduling latency

Figure 1

Rtsl: a latency measurement tool

Experiments

Figure 5

Final Remarks

Yosef Moatti

Oshrit Feder

Guy Khazma

Gal Lushi

Paula Ta-Shma

Luis Tomás Bolivar

Miki Kenneth

Josh Salomon

Miroslav Bureš

Danni Shi

Matej Hrušovský

Peter Desnoyers

Matej Hrušovský

Lis Strenger

Clodagh Walsh

Ryan Jenkins

Marek Grác

Gordon Haff