Europe RIG Meeting [April 2023]
Checkpointing and Rollback-Recovery of Distributed Applications in Kubernetes
Research talk by Radostin Stoyanov, Oxford University
In recent years, containers have gained widespread adoption for developing cloud-native applications and microservices. These applications are typically composed of multiple containers managed by a container orchestration platform. However, the techniques currently offered by container orchestration platforms to replicate containers are insufficient to provide high-availability and strong consistency for distributed stateful applications. In this talk, we will explore how container checkpointing can be utilized to provide fault-tolerance for distributed applications, and discuss some of the benefits and challenges. This talk will also provide insights into how coordinated checkpointing can effectively enhance the reliability and resilience of stateful applications in Kubernetes clusters.
Radostin Stoyanov is a PhD student at the University of Oxford. His research focuses on improving the resilience and performance of HPC and cloud computing systems. Before joining Oxford, Radostin received his MPhil degree in Advanced Computer Science from University of Cambridge, and his MEng degree in Computing Science from University of Aberdeen. His master’s research explored virtualization in programmable network devices and secure image-less container migration.
Related research project: SnappyOS: Fault-Tolerant and Energy-Efficient Framework for HPC Applications