Ceph Drive Failure Prediction

More than a million terabytes of data is generated every day from sources such as emails, social media platforms, IoT devices, etc. A lot of this data gets saved to persistent storage. Since every bit of data is valuable, modern storage solutions need to be reliable, scalable, and efficient. Therefore many storage systems, including Ceph, use replicas or erasure-coded redundancy to provide fault-tolerance. So while scaling storage up to exabyte-level is possible, it can be resource-intensive. Nonetheless, this issue can be mitigated using machine learning.

The primary goal of this project is to build a model to predict if a hard drive will fail within some predefined time interval in the future. These predictions could then be used to create or destroy replicas accordingly, thus making storage more resource-efficient. The models in this project are trained using the Backblaze SMART metrics dataset, which is the only publicly available dataset of SMART metrics (as of May 2019). A secondary goal of this project is to frame this problem in a Kaggle competition format, to provide a platform to the community to contribute their ideas.

Red Hat Intern: Karan Chauhan