SnappyOS: Fault-Tolerant and Energy-Efficient Framework for HPC Applications

Project summary:

SnappyOS is a framework for high-performance computing (HPC) applications that aims to provide fault-tolerance while optimizing the resource utilization and energy efficiency of cluster resources. The framework consists of a comprehensive set of tools and libraries that allows application developers to build distributed data processing applications that are robust, reliable, and achieve optimal energy consumption.

Project description:

SnappyOS is a framework for providing fault-tolerance for HPC applications that builds on the Checkpoint/Restore in Userspace (CRIU) engine. CRIU is a tool for checkpointing and restoring Linux processes that enables the runtime-state of containerized applications to be saved to persistent storage and restored at a later point in time on the same or a different physical or virtual machine. By using CRIU, SnappyOS aims to provide a highly available computing environment that is critical for many scientific applications. These applications can run for days or weeks performing data processing tasks on large-scale distributed HPC systems. When a node failure occurs, CRIU can be used to restore a consistent global state of the application from the most recent checkpoint. This technique allows applications to continue running seamlessly, without loss of data or computational progress, even in the event of a node failure. In addition, by using CRIU to migrate applications from one node to another, SnappyOS can optimize the energy consumption by dynamically relocating applications to improve data locality and balancing workloads across cluster nodes. In this way, CRIU plays a critical role in enabling SnappyOS to provide fault-tolerance and energy-efficiency of large-scale scientific applications.

Other involved Red Hatters: