Red Hat Research Quarterly

How to open source cloud operations

Red Hat Research Quarterly

How to open source cloud operations

about the author

Marcel Hild

Marcel Hild has 25+ years of experience in open source business and development. He co-founded a Linux consulting company and worked as a freelance developer, a Solution Architect for Red Hat, and core Developer for Cloudforms, a Hybrid Cloud Management tool. Now he researches the topic of AIOps in the Office of the CTO at Red Hat, proving how AI will help operating machines and applications.

Article featured in

Red Hat Research Quarterly

August 2020

In this issue

Open source has become a dominant paradigm for developing software. One major factor for its success is its transparency: if you have a problem with the software, you can peek into the details of the code, search the issue tracker, ask for help, and maybe even provide a fix. This means that even though most users don’t write code, the mere fact that everything is open will help the majority of users. Now it’s time to apply the open source model to the cloud.

No line of code gets merged if it does not pass the tests. But what about operations?

Today, in the age of cloud computing, we consume provided services that we expect to just work. And our applications are a complex mesh of those services. Developers need to configure software on demand with elasticity, resilience, security, and self-service in mind. That means the implementation and operations of those services, i.e., the cloud, has become equally more complicated.

If open source made software great, how do we open source an implementation or the operation of something? By definition it’s always different; there is no single binary that gets deployed multiple times. Instead it’s an implementation of a procedure, a process. Same with operations: it’s all the live data of metrics, logs, and tickets, and how software and the operations team react to it. So all implementations of a cloud, be it the large-scale proprietary public service or the on-premise private cloud, are snowflakes. Yes, best practices exist and there are excellent books. But still, you can’t `git clone cloud` or `rpm -i cloud`.

Extending access to operations

If open source made software great, how do we open source an implementation or the operation of something?

So we need to open up what it takes to stand up and operate a production-grade cloud. This must not only include architecture documents, installation, and configuration files, but all the data that is being produced in that procedure: metrics, logs, and tickets. You’ve probably heard the AI mantra that “data is the new gold” multiple times, and there is some deep truth about it. Software is no longer the differentiating factor: it’s the data. Dashboards, post-mortems, chat logs— everything. Basically we need a public, read-only access. 

Signing up for read-write should be easy. Lowering the barrier of access was key to the success of open source, so let’s lower the barrier to peek into the back office of the cloud as well. It opens up a slew of new opportunities. Suddenly we can create a real operations community. Current operations communities either center around a particular piece of technology, like the Prometheus monitoring community, or a certain approach to operations, like the Site Reliability Engineering (SRE) methodology. These are great, but we can also bring it down from the meta-level to the real world, where you can touch things. If you can’t log into it, it does not exist.

We can also extend the community to people that operate their clouds. Those human DevOps people can watch and learn how a cloud is operated, then contribute by sharing their opinion on architectural decisions or their internal practices, and maybe even engage in operating bits of the open cloud. It’s the same progression as in open source projects.

Shifting to Operate First

There’s a principle in development called Shift Left, which means that we should involve testing really early in the development cycle—in other words, moving left in the process. This is already done with unit and integration tests. No line of code gets merged if it does not pass the tests. But what about operations?

At Red Hat we coined the term Operate First for this. The idea is similar to Upstream First, where we strive to get every line of code into an upstream project before we ship it in a product. In Operate First, we want to run the software in an operational context by the group that develops the software. And since we develop mainly in open source communities, this extends our open cloud to another group of people, the engineering community. The very authors of the code can be asked in an incident ticket about a misbehaving piece of the cloud. This not only increases the probability of getting the incident closed quickly, but it also exposes the software developer to the operational context of his brainchild. Maybe he comes back later and just watches how his software is being used and makes future design decisions based on the operations. The next level would be to try out new features in bleeding-edge alpha versions of a particular service and get a real workload instead of fake test data. 

Bringing in AIOps

Speaking of data, that brings us to the next audience of an open cloud:  the research and AI community. AIOps is another term that is being used frequently—and to be honest it is as nebulous as the term cloud was a decade ago. To me, it means to augment IT operations with the tools of AI, which can happen on all levels, starting with data exploration. If a DevOps person uses a Jupyter notebook to cluster some metrics, I would call it an AIOps technique. And since the data is available at the open cloud, it should be pretty easy.  

But the road to the self-driving cluster is paved with a lot of data—labeled data. You will find large data sets with images that are labeled as a cat, but try to find data sets of clusters that are labeled with incidents. Creating such data sets and publishing them under an open license will spark the interest of AI researchers, because suddenly we can be more precise about a problem when we can be data driven. We can try to predict an outage before it happens. 

Once the model is trained and tested against the test data, with the open cloud we can go even one step further. Researchers can collaborate with the operations team to validate their models against a live target. Operations can then adopt the model to enhance their operational excellence and finally involve software engineering. Ultimately, you want the model and the intelligence captured in code, right in the software that is being deployed—the software that will be deployed in another datacenter, in another incarnation of a cloud. That way, it will improve the operational excellence of all the clouds. This brings us closer to a world where operations of a cloud can be shared and can be installed, since it’s embedded in the software itself. To get there, we need that feedback cycle and an open source community that involves all three parties—operations, engineering, and research—and we need a living environment to iterate upon.

Sounds like a story from the future? The process has already begun. Red Hat is working with an evolving open cloud community at the Massachusetts Open Cloud to help define an architecture of an open cloud environment where operability is paramount and data-driven tools can play a key role. All discussions happen in public meetings and, even better, are tracked in a Git repository, so we can involve all parties early in the process and trace back how we came to a certain decision. That’s key, since the decision process is as important as the final outcome. All operational data will be accessible, and it will be easy to run a workload there and to get access to backend data. 

If you’re interested in collaborating, join us at

More like this