The recent advances in AI and telecommunications are enabling a new set of complex cyber-physical systems, including those for safety-critical applications. Safety-critical systems are systems whose failure can result in significant damage, including loss of life. This class of systems ranges from medical devices up to advanced driver-assistance systems (ADAS) for vehicles. Some of these advances rely on a sophisticated software stack, requiring full-featured operating systems such as Linux. Among the features enabling Linux in such environments are the real-time Linux kernel features.
This is the first in a series of three articles about the formal analysis and verification of the real-time Linux kernel.
Real-time systems are computing systems whose correct behavior depends not only on logical behavior but also on timing behavior. For example, the detection of an obstacle in an autonomous vehicle should result in a set of actions that need to be taken in time to avoid a collision.
Linux was not designed as a real-time operating system (RTOS) from scratch. Instead, it was adapted to become one. Nowadays, Linux has a set of advanced RTOS features. For example, it can schedule tasks using an advanced deadline-based scheduler (SCHED_DEADLINE) and react to external events within fewer microseconds with the PREEMPT_RT patchset. Such features enabled a vast set of applications, from high-frequency trading systems to low-latency network communication.
However, the path taken by Linux developers in the analysis of the system differs from the method used by real-time researchers. The challenge for real-time researchers is to demonstrate that the coordinated behavior of the system produces results for all tasks before their respective deadlines, in the worst case. A common approach for such a demonstration starts with the analysis of the specification of the system and the formal definitions of the system properties. These properties are then translated into a set of variables that are used on mathematical analysis of the response time of the tasks of a system.
The evaluation of the real-time features of Linux took a more experimental approach: developing tools to simulate real-time workloads, measuring the response time for their requests, and not explicitly evaluating the internals of the OS kernel. The problem with such an approach is that it is not enough to provide any strong evidence that the worst-case scenarios were found. Moreover, the absence of a formal analysis of the timing behavior of Linux, in the terms used in the real-time scheduling theory, is a challenge for the application of Linux on safety-critical systems where such sophisticated analysis is required.
The reason Linux developers use this approach has its roots in the Linux kernel complexity. The amount of effort required to understand all the constraints imposed by the synchronization mechanism on real-time tasks on Linux is not negligible. The understanding of the synchronization primitives and how they affect the timing behavior of a thread is fundamental for the definition of Linux in terms of real-time scheduling. The challenge is then to describe such operations using a level of abstraction that removes the complexity of the in-kernel code, and to do this in a format that facilitates the understanding of Linux dynamics for real-time researchers, without being too far from the way developers observe and improve Linux.
A model is an abstraction of a system. The process of modeling a system involves the definition of a set of measurable variables associated with the given system. The subset of variables acting on the system from outside are considered input variables, while the subset of variables that are possible to measure while varying the input are defined as the set of output variables, as in Figure 1. The modeling phase of a system also comprises the definition of the mathematical relationship between the input and the output.
The use of mathematical notation removes the ambiguous nature of natural language and enables the application of a more sophisticated analysis of the runtime behavior of Linux. The problem is, which mathematical method could be used to model the Linux behavior?
The developers of Linux observe and debug the timing properties of Linux using the tracing features present in the kernel. For example, they interpret a chain of events, trying to identify the states that cause delays in the activation of the highest priority thread, and then try to change kernel algorithms to avoid such delays. The notion of events, traces, and states used by developers is common to Discrete Event Systems (DES), which present automata as a modeling formalism.
The automata formalism is well established as a language for the modeling and verification of systems. Automata are characterized by the directed graph or state transition diagram representation, as in Figure 2, where the arcs represent the events, and circles mean the states. This simple format hides the complexity of the mathematical definition, which allows the use of sophisticated operations and analysis while allowing an intuitive way for reasoning about the property being specified, close to the way that the kernel developers already use while analyzing the traces of the system.
Modeling Linux’s thread behavior
To enable the use of a more sophisticated analysis of the timing behavior of Linux tasks, we proposed an automata-based model for Linux threads (which are the Linux task’s entities). The model aims to remove the code complexity of Linux by presenting a simplified view of the system, but still using the same set of abstractions used by kernel developers, enabling efficient communication between kernel developers and real-time researchers.
The approach used in the PREEMPT_RT Thread model development is presented in Figure 3:
The informal knowledge about the timing behavior of the Linux tasks was modeled as a set of formal specifications using the automata theory. The model was built upon a set of events used by kernel developers to analyze and describe the evolution of Linux threads. These events can be observed and analyzed using the Linux kernel tracing features.
The final version of the model was composed of 34 events, 9,017 states, and 20,103 transitions, demonstrating how complex the timing behavior of Linux threads is. However, the model was not built as a single monolithic automaton. Instead, the model was created using a modular approach, in which the final model is composed of the synchronization of a set of small models. These small models are divided into two classes. The generators represent the independent actions of the system. For example, IRQs can be disabled and enabled (Figure 4), and the scheduler can be called and returned (Figure 5). The specifications describe the coordinated behavior of the generators. For example, the model presented in Figure 6 specifies that the scheduler cannot be called while interrupts are disabled.
The final model is composed of 12 generators and 33 specifications. The vast majority of the generators and specifications are modeled with only two or three states, while the largest has only eight states. This is an essential factor: in the end, the complexity of Linux is indeed composed of a set of small specifications.
The validation of the model was done using both the analysis of the properties of the automata, as well as a comparison of the model against the trace of the execution of the system.
The automata format allowed the analysis of the non-functional properties of the model. For example, the thread model is deterministic, meaning that one event can lead the system to a single conclusion. The model is free of deadlocks and livelocks, and it is possible to reach a safe state (in this model, the initial state) from all states of the system.
One of the main benefits of using the common event abstraction is that it facilitated the automatic validation of the model. During the development of the model, the perf tracing tool was extended to enable the execution of the automaton by using the events generated by a real implementation of the system. Initially, the tool pointed to many cases that were not initially covered by the model. However, at a given point, the tool started to unveil points in the Linux kernel code that were not following the specifications. Such cases were analyzed and reported to the kernel community, which confirmed three bugs in the kernel, evidencing the model’s adequacy.
Usage of the model
The model has found two main applications: the runtime verification of formal specifications and the timing analysis of the Linux kernel.
The discovery of kernel bugs using the automata-based models motivated the usage of formal specifications for the runtime verification of the kernel. Indeed, the specifications presented in this research were later used as the basis for the development of an efficient method for the formal verification of the Linux kernel.
The model was also used as the base for the formal definition of the real-time Linux scheduling latency components, using the same mathematical approach used in the real-time scheduling theory, allowing a new set of timing analyses for the real-time Linux kernel.
Linux is a sophisticated real-time operating system and is enabling the development of a new set of cyber-physical systems, many of them with safety-critical and real-time requirements. Such a class of systems requires the application of sophisticated analysis that evidences the correct behavior of the system, both in the logical and timing perspectives. Automata-based model usage allowed the formal specification of an intricate part of the Linux kernel, enabling the unambiguous understanding of the system behavior from the real-time systems theory perspective and the runtime verification of the adherence of the kernel code to the expected behavior. The details of the analysis enabled by the automata model will be the subject of future articles in RHRQ.
This research was done in a collaboration of Red Hat with Prof. Rômulo Silva de Oliveira (Universidade Federal de Santa Catarina) and Prof. Tommaso Cucinotta (Scuola Superiore Sant’Anna).
More details of this research can be found in this paper: Daniel B. de Oliveira, Rômulo S. de Oliveira, and Tommaso Cucinotta, “A Thread Synchronization Model for the PREEMPT_RT Linux Kernel,” Journal of Systems Architecture 107 (2020). DOI: 10.1016/j.sysarc.2020.101729.