AI for Cloud Ops
AI for cloud Ops is a project of the Red Hat Collaboratory at Boston University.
Today’s Continuous Integration/Continuous Development (CI/CD) trends encourage rapid design of software using a wide range of customized, off-the-shelf, and legacy software components, followed by frequent updates that are immediately deployed on the cloud. Altogether, this component diversity and breakneck pace of development amplify the difficulty in identifying, localizing, or fixing problems related to performance, resilience, and security. Existing approaches that rely on human experts have limited applicability to modern CI/CD processes, as they are fragile, costly, and often not scalable.
This project aims to address this gap in effective cloud management and operations with a concerted, systematic approach to building and integrating AI-driven software analytics into production systems. We aim to provide a rich selection of heavily-automated “ops” functionality as well as intuitive, easily-accessible analytics to users, developers, and administrators. In this way, our longer-term aim is to improve performance, resilience, and security in the cloud without incurring high operation costs.
Graphic caption: An illustrative overview of the “AI for Cloud Ops” project, which aims to demonstrate the performance, resilience, and security benefits of AI-driven cloud analytics in modern continuous integration/continuous deployment environments. The project will make customized analytics available to developers and administrators via queryable APIs during open-source software deployment (e.g., through Jupyter notebooks) and at runtime.
Other Funding that Supports this Research
- Ayse Coskun, IBM Faculty Award, 2020
- Ayse Coskun (Co-PI), NSF CISE CSR, A Just-in-Time, Cross-Layer Instrumentation Framework for Diagnosing Performance Problems in Distributed Applications. PI: Raja Sambasivan at Tufts University, 2018-2022
- Ayse Coskun, IBM Open Collaborative Research Award, 2016-2020
- Ayse Coskun, Red Hat Collaboratory, 2018-2020
Project Resources and Repositories
- Operate First/AI for Cloud Ops GitHub
- Iter8 Online Experimentation Framework
- Praxi: Software Discovery using ML
- ACE: Approximate Concrete Execution
- OSD Alert Analysis GitHub
- Description: OSD Alert Analysis is a web-based tool for analyzing operational alerts produced by OpenShift Dedicated clusters. Users can use alert namespace, severity, resolution time, and frequency filters to see alerting trends over time as well as identify candidates for alert threshold tuning.
Principal Investigator: Ayse Coskun
Co-PIs: Alan Liu and Gianluca Stringhini
Red Hat Collaborators: Marcel Hild, Steven Huels, and Daniel Riek
IBM Collaborator: Fabio Oliveira
Graduate Students: Anthony Byrne, Mert Toslali, Saad Ullah, and Lesley Zhou
Read: Machine learning for operations: Can AI push analytics to the speed of software deployment?, Red Hat Research Quarterly, May 2022
RHRQ asked Professor Ayse Coskun of the Electrical and Computer Engineering Department at Boston University to sit down for an interview with Red Hatter Marcel Hild. Professor Coskun is one of the Principal Investigators on the project AI for Cloud Ops, which recently won a $1 million Red Hat Collaboratory Research Incubation Award. Their conversation delves into the need for operations-focused research on real-world systems and the capacity of more mature AI technology to solve problems on a large scale. (Read full article)
Watch: Research Days AI for Cloud Ops Talk, February 16, 2022 (event page with abstract)