SpotOS – a distributed cloud-based operating system over unreliable resources
The aim of this project is to devise and implement a distributed cloud-based operating system that uses unreliable or temporarily available resources to provide reliable and scalable execution experience with high quality-of-service. The proposed system will achieve this vision by harnessing the resources representing the currently unused cloud capacity. These resources are known in AWS as spot instances.
Spot instances present a considerably cheaper (up to a 90% discount) alternative to renting ‘regular’ instances. However, the low price tag comes with a caveat: a workload running on a spot instance can be unexpectedly interrupted at any time when the instance is reclaimed by the cloud provider. In this case, a very limited time window is given to the running application to back up its current state. As a result, spot instances are mostly used by stateless applications, applications with very small state size, or those with a dedicated fault-tolerance mechanism.
With SpotOS, we intend to overcome the above limitations by providing a reliable, adaptive, “smart lake of resources” abstraction with the spot instances serving as the underlying unreliable building blocks. SpotOS will handle the orchestration of user workloads over this unreliable cloud resource layer while guaranteeing stable, uninterrupted execution combined with low cost of deployment.
One of the main obstacles to realizing this vision is the challenge of promptly migrating a complex stateful application when its spot instance is reclaimed. Due to the short-notice nature of this event, copying the entire state to a “safe haven” could not be possible. We aim to solve this by introducing EDM – an external distributed memory mechanism. With EDM, the application state is split among multiple storage units (that could be hosted on regular instances, spot instances, or a mix of both) in sufficiently small chunks for the evacuation to be completed on time.
To ensure optimal performance in presence of diverse applications and multiple heterogeneous resources, SpotOS will have to continuously calculate the best possible application-resource assignments. The resulting runtime configuration must guarantee high quality-of-service while at the same time minimizing the number of application migrations and the risk of execution preemption. A dedicated component that we call EC (economic calculator) will be in charge of solving this exceedingly complex optimization problem. Additionally, we will utilize the power of the latest advances in predictive analytics and time series forecasting to estimate future resource consumption based on the learned application behavior, thus enriching and enhancing the optimization capabilities.
The figure at the top of the page illustrates the proposed structure of SpotOS.