Red Hat Research Quarterly

How to open source cloud operations

Marcel Hild

Marcel Hild has 25+ years of experience in open source business and development. He co-founded a Linux consulting company and worked as a freelance developer, a Solution Architect for Red Hat, and core Developer for Cloudforms, a Hybrid Cloud Management tool. Now he researches the topic of AIOps in the Office of the CTO at Red Hat, proving how AI will help operating machines and applications.

Article featured in

Red Hat Research Quarterly

August 2020

Download PDF

Subscribe now

In this issue

From the Director

Reproducible research

Hugh Brock

News

Red Hat Research Days coming this fall

Gagan Kumar

Heidi Dempsey

News

Why you should (virtually) attend Devconf.US

Gordon Haff

Feature

How expensive is it to crack a password derived with Argon2? Very

Vojtěch Polášek

Feature

Don’t blame the developers: making security usable for IT professionals

Martin Ukrop

Feature

Isn’t multi-tenancy Ironic?

Tzu-Mainn Chen

Lars Kellogg-Stedman

Interview

Voyage into the open Dataverse

Sherard Griffin

Feature

Fostering open innovation in hardware

Yan Fisher

Feature

How to open source cloud operations

Marcel Hild

Project Updates

Greater Boston research update: June 2020

Open source has become a dominant paradigm for developing software. One major factor for its success is its transparency: if you have a problem with the software, you can peek into the details of the code, search the issue tracker, ask for help, and maybe even provide a fix. This means that even though most users don’t write code, the mere fact that everything is open will help the majority of users. Now it’s time to apply the open source model to the cloud.

*No line of code gets merged if it does not pass the tests. But what about operations?*

Today, in the age of cloud computing, we consume provided services that we expect to just work. And our applications are a complex mesh of those services. Developers need to configure software on demand with elasticity, resilience, security, and self-service in mind. That means the implementation and operations of those services, i.e., the cloud, has become equally more complicated.

If open source made software great, how do we open source an implementation or the operation of something? By definition it’s always different; there is no single binary that gets deployed multiple times. Instead it’s an implementation of a procedure, a process. Same with operations: it’s all the live data of metrics, logs, and tickets, and how software and the operations team react to it. So all implementations of a cloud, be it the large-scale proprietary public service or the on-premise private cloud, are snowflakes. Yes, best practices exist and there are excellent books. But still, you can’t `git clone cloud` or `rpm -i cloud`.

Extending access to operations

If open source made software great, how do we open source an implementation or the operation of something?

So we need to open up what it takes to stand up and operate a production-grade cloud. This must not only include architecture documents, installation, and configuration files, but all the data that is being produced in that procedure: metrics, logs, and tickets. You’ve probably heard the AI mantra that “data is the new gold” multiple times, and there is some deep truth about it. Software is no longer the differentiating factor: it’s the data. Dashboards, post-mortems, chat logs— everything. Basically we need a public, read-only access.

Signing up for read-write should be easy. Lowering the barrier of access was key to the success of open source, so let’s lower the barrier to peek into the back office of the cloud as well. It opens up a slew of new opportunities. Suddenly we can create a real operations community. Current operations communities either center around a particular piece of technology, like the Prometheus monitoring community, or a certain approach to operations, like the Site Reliability Engineering (SRE) methodology. These are great, but we can also bring it down from the meta-level to the real world, where you can touch things. If you can’t log into it, it does not exist.

We can also extend the community to people that operate their clouds. Those human DevOps people can watch and learn how a cloud is operated, then contribute by sharing their opinion on architectural decisions or their internal practices, and maybe even engage in operating bits of the open cloud. It’s the same progression as in open source projects.

Shifting to Operate First

There’s a principle in development called Shift Left, which means that we should involve testing really early in the development cycle—in other words, moving left in the process. This is already done with unit and integration tests. No line of code gets merged if it does not pass the tests. But what about operations?

At Red Hat we coined the term Operate First for this. The idea is similar to Upstream First, where we strive to get every line of code into an upstream project before we ship it in a product. In Operate First, we want to run the software in an operational context by the group that develops the software. And since we develop mainly in open source communities, this extends our open cloud to another group of people, the engineering community. The very authors of the code can be asked in an incident ticket about a misbehaving piece of the cloud. This not only increases the probability of getting the incident closed quickly, but it also exposes the software developer to the operational context of his brainchild. Maybe he comes back later and just watches how his software is being used and makes future design decisions based on the operations. The next level would be to try out new features in bleeding-edge alpha versions of a particular service and get a real workload instead of fake test data.

Bringing in AIOps

Speaking of data, that brings us to the next audience of an open cloud: the research and AI community. AIOps is another term that is being used frequently—and to be honest it is as nebulous as the term cloud was a decade ago. To me, it means to augment IT operations with the tools of AI, which can happen on all levels, starting with data exploration. If a DevOps person uses a Jupyter notebook to cluster some metrics, I would call it an AIOps technique. And since the data is available at the open cloud, it should be pretty easy.

But the road to the self-driving cluster is paved with a lot of data—labeled data. You will find large data sets with images that are labeled as a cat, but try to find data sets of clusters that are labeled with incidents. Creating such data sets and publishing them under an open license will spark the interest of AI researchers, because suddenly we can be more precise about a problem when we can be data driven. We can try to predict an outage before it happens.

Once the model is trained and tested against the test data, with the open cloud we can go even one step further. Researchers can collaborate with the operations team to validate their models against a live target. Operations can then adopt the model to enhance their operational excellence and finally involve software engineering. Ultimately, you want the model and the intelligence captured in code, right in the software that is being deployed—the software that will be deployed in another datacenter, in another incarnation of a cloud. That way, it will improve the operational excellence of all the clouds. This brings us closer to a world where operations of a cloud can be shared and can be installed, since it’s embedded in the software itself. To get there, we need that feedback cycle and an open source community that involves all three parties—operations, engineering, and research—and we need a living environment to iterate upon.

Sounds like a story from the future? The process has already begun. Red Hat is working with an evolving open cloud community at the Massachusetts Open Cloud to help define an architecture of an open cloud environment where operability is paramount and data-driven tools can play a key role. All discussions happen in public meetings and, even better, are tracked in a Git repository, so we can involve all parties early in the process and trace back how we came to a certain decision. That’s key, since the decision process is as important as the final outcome. All operational data will be accessible, and it will be easy to run a workload there and to get access to backend data.

If you’re interested in collaborating, join us at openinfralabs.org.

SHARE THIS ARTICLE

QUBIP and the transition to post-quantum cryptography

Gordon Haff

Quantum computing could put secure communication at risk sooner than you think. Current research aims to solve the problem before it starts. Post-quantum cryptography (alternatively, quantum-resistant cryptography) probably consumes more bandwidth than it should in quantum computing discussions. That’s because the potential to incrementally improve the efficiency of important but mundane tasks like optimizing logistics […]

Feature

Adaptive streaming using Strimzi and Apache Kafka

Adam Cattermole

The competing demands of cost and performance make it challenging to optimize stream-processing applications. Current research is exploring new options. Extracting value from streams of events generated by sensors and software has become key to the success of many important classes of applications. However, writing streaming data applications is not easy. Developers are confronted with […]

Feature

Making machine learning accessible across disciplines

Marek Grác

Machine learning has been driving research breakthroughs in many fields. Now there is an open source curriculum designed to help non-specialists build the skills they need to use it. Machine learning is an increasingly important competency in a growing number of fields. Biochemists are using it to create models for protein engineering. Economists are using […]

Feature

“When one teaches, two learn”: making the most of technical research mentorship

Matej Hrušovský

Lis Strenger

Research mentorships are the basic building block of productive industry-university relationships. We asked four mentors from around the globe to tell us about the challenges, rewards, and strategies of serving as a mentor. Linking a student’s research goals with the experience of a Red Hat software engineer is at the crux of the Red Hat […]

Feature

How open data standards make Brno a better city

Robert Spal

Brno, Czech Republic, is home to the world’s largest Red Hat technology center, and it was the birthplace of the university-industry relationship model that became Red Hat Research. Here’s how the smart city concept has been implemented in one of our hometowns. The article stems from a presentation at DevConf.cz 2022. To flourish in an […]

Feature

Tuning Linux kernel policies for energy efficiency with machine learning

Han Dong

Presenting BayOp, a generic ML-enhanced controller that optimizes network application efficiency by automatically controlling performance and energy trade-offs. As global datacenter energy use rises and energy budgets are constrained, it becomes increasingly important for operating systems (OS) to enable higher efficiency and get more work done while consuming less. Concurrently, the environmental footprint of hardware […]

Feature

Mental models: Qualitative research to design for Red Hat OpenShift users

Carl Pearson

Brian Dellascio

Sarahjane Clark

To design effectively for our users, we need to learn more about them. If we don’t, we may make a product that our users can’t be efficient in, or worse, a product that our users have no need for in the first place.

Feature

Demystifying real-time Linux scheduling latency

Daniel Bristot de Oliveira

This is the third of a series of three articles about the formal analysis and verification of the real-time Linux® kernel. Read the first article in RHRQ 2:3 and the second article in RHRQ 2:4.

Feature

Verifying programs that communicate with the environment

Henrich Lauko

Writing tests with high coverage is almost always tedious work that is still error prone. This can lead to missing crucial details that cause undesirable behavior, and, in the worst case, a complete system failure. What if there were an efficient way to automate this work?