Red Hat Research Quarterly

Behind the cloud: the engineering work powering the Mass Open Cloud research environment

Taj Salawu

Taj Salawu is a software engineer at Red Hat Research working on the Mass Open Cloud. Prior to that, he was an intern with Red Hat Research, working on another open source cloud development project (Operate First). He graduated with a BA in Computer Science from The College of The Holy Cross in 2023, where he also played D1 soccer.

Related Projects

Mass Open Cloud (MOC): An open, distributed platform enabling AI/ML workloads

Article featured in

Red Hat Research Quarterly

Fall 2025

Download PDF

Subscribe now

In this issue

News

Publication highlights—Fall 2025

Feature

GREEN.DAT.AI: an energy-efficient, AI-ready data space

Ben Capper

Feature

Open Education Project tackles GPU scheduling and metrics visibility

Danni Shi

Feature

Refined Yuga analysis tool for detecting code defects in Rust improves usability

Anne Mulhern

News

The post-quantum cryptography transition: researching a quantum-safe future

Martin Ukrop

Arthur Savage

From the Director

Closing the AI gap: why we can’t leave students—or Montana—behind

Heidi Dempsey

Column

Behind the cloud: the engineering work powering the Mass Open Cloud research environment

Taj Salawu

Interview

“It’s the wild frontier”: security, agentic AI, and open source

Ryan Cook

Engineers on the Mass Open Cloud are continually developing new capabilities for the research resource. Here’s how.

For many people, “the cloud” is a very abstract entity. It’s a place to store their photos and data, and they don’t expect to have control over it beyond setting a password for access. In the world of academia and research, however, the cloud is not just a digital archive, it’s a fundamental environment for their work. The Mass Open Cloud (MOC), which I work on, is a powerful example. The MOC gives researchers in computing, healthcare, science, and other fields access to compute resources that were previously inaccessible. Collaborating with students, faculty, and MOC staff, I’ve seen the needs and challenges they face and worked directly with them to develop new solutions—often solutions that give us better tools and insights to benefit all users.

One thing I’ve observed working with these groups is that while it is easy to develop locally on a laptop, it’s not always sufficient. It’s hard to test scaling, projects may not have enough resources (e.g., GPUs, CPUs, or memory), and dependency issues may arise with people running different OSes. Developing in the final staging environment instead of a laptop also allows for an easier transition from development to production. By using the MOC’s Red Hat OpenShift environment for container orchestration and its OpenStack environment for virtual machines, users are empowered to do a lot more at a fraction of the cost of other infrastructure providers. Students and researchers can run compute-intensive workloads, deploy applications to a production environment, and collaborate seamlessly with not only their fellow students but also with Red Hat engineers sharing their real-world expertise—all without the need to own their own expensive hardware or match complex software/driver requirements.

Currently, the MOC provides access to FC430 and 830s for CPUs and A100s, V100s, and H100s for GPUs. All of those machines have large compute power and are used to build production OpenShift and OpenStack environments. They are also available to be leased as bare metal machines, where users can install their own operating systems. This is invaluable not just for researchers, but for industry engineers. For example, Red Hat’s Emerging Technology (ET) team has also used it for distributed model training development and other AI initiatives. The MOC also provides users with preconfigured telemetry to give helpful insight into what is going on on the hardware level, for example, in terms of performance, usage, and system health. To maintain this environment, ensure adherence to best practices, and—most important—continue upgrading to stay relevant and useful, a lot of work goes on behind the cloud.

Meeting diverse research requirements

When I started as an intern at Red Hat Research, I was assigned to work on Operate First, which, according to its GitHub page, was focused on “open sourcing operations on community-managed clusters.” Its goal was to create an environment for engineers to develop and deploy applications. Sounds awfully familiar: from my internship to coming back to Red Hat full time, there was a natural progression from a project creating community-managed clusters to working on the MOC, a collection of managed clusters for research.

As an engineer with Red Hat Research, part of my job is to explore new capabilities for the MOC. This includes efforts to improve the process of deploying new clusters, creating templates, writing runbooks for processes, and testing the use of Hypershift (Host control planes) to lower resource usage when deploying multiple clusters. This work is pivotal to many users, as their development work cannot be done in a large shared cluster, for example because of access level or specific network configuration requirements. I’m often tasked with getting new use cases working in the current environment. For instance, we were recently asked by the Red Hat Openshift AI business unit to integrate an MOC cluster into the vLLM CI pipeline as an environment where we could deploy machine learning workloads for developers. While I was able to get it running, the users employed a platform that had not previously been deployed and tested on the MOC.

When I joined the MOC team, the environment had three clusters: Production, where most users run their workloads in dedicated namespaces with specific resource allocations; Infra, an ACM hub that manages the other clusters; and Test, an environment to test upgrades and new operators before adding them to production. Over time, we have added an observability cluster, which aggregates metrics from all managed clusters and displays them using Grafana dashboards. The observability cluster, coupled with fine-grained access control, empowers users to examine bare metal metrics, information that is often integral to their development.

We also established several bespoke test clusters, providing crucial environments for groups whose development work demands full admin privileges or specific network configurations. This summer we built an academic cluster for classes taught using the Open Education Project (OPE). This cluster is upgraded less frequently so as not to interrupt classes. These additions represent a substantial enhancement to the MOC’s capabilities. They also provide a good example of how we work to address diverse user needs while enriching the research and development ecosystem.

They also provide a good example of how we work to address diverse user needs while enriching the research and development ecosystem.

Apart from the bespoke clusters, where the users of the cluster are given full admin access, all other clusters are managed by Openshift GitOps running on the infra cluster. This allows for the repo on GitHub to be used as the main source of truth. Outside of very minor testing on the test cluster, all changes to resources happen by creating or amending a YAML manifest in the OCP-on-NERC GitHub repo. When I first started, this was a very daunting repo to look at; however, as I got a greater understanding of OpenShift, the structure of the repo made a lot more sense. Not only does it allow tracking changes, it also permits reproducibility for the clusters themselves. With the correct infrastructure and Secrets in place, applying Kustomize for a specific cluster to a fresh OpenShift install should re-create that cluster. This also meant that post configuration of new MOC clusters could be templated, speeding up the process of deploying new clusters. Building templates was one of the first issues I worked on, and it resulted in this cluster-templating repo containing several Ansible files to create an overlay for a new cluster when provided with the correct variables. Applying the generated Kustomize file installs all the common operators and configurations shared across all MOC clusters.

*The author as a college student playing for Holy Cross (Worcester, Massachusetts), where he was team captain.*

Collaborative engineering to empower more users

When I was in college, I wasn’t just a computer science major, I was also a soccer captain, which has proven surprisingly useful. In my current role, I’ve had the opportunity to work with the MOC staff, students and teachers, and various groups within Red Hat, and each group has presented a unique set of challenges. They have different sets of requirements, from specific software versions to unique network setups, which can make a single solution for everyone impossible. Furthermore, deploying applications to OpenShift often requires extensive debugging to resolve any issues that may arise.

Fortunately, I find these challenges exciting. Much like coordinating a team on the field, addressing the challenges of engineering for the MOC requires constant communication, collaboration, and a shared effort to overcome issues and achieve a working solution.

SHARE THIS ARTICLE

Better together

Beverly Kodhek

Thoughts on open source and open collaboration from the Greater Boston Research Interest Group (RIG).

Column

Shared knowledge or private IP? That is the question

RHRQ interviewed Idan Levi, the Research Interest Group leader in Israel, to get his take on how university research intersects with the open source approach, from datasets and collaboration to security and data privacy.

Column

Building better through research

Heidi Dempsey

Collaboration between students and engineers is shaping the future of software, hardware, and as-yet-unimagined complex computer systems. I like to read science fiction, partly as an escape from thinking about the complex real-life technology and people we work with daily in the Red Hat Research group. Our research and development projects are fascinating, but some […]

Column

Setting the standard for PhD support in the Czech Republic

Matej Hrušovský

Collaboration between industry and academia has become a popular talking point in tech, but that wasn’t always the case. When Red Hat Czech took the first steps towards supporting PhD students directly, this type of funding and mentorship was not common. The initiative began in 2015, in cooperation with the Faculty of Informatics at Masaryk […]

Column

Composing a research symphony

Heidi Dempsey

Many talents contributed to one goal: a shared production-level research cloud. It was a chilly morning at Boston University, and I was looking for a quiet place to gather my thoughts and do some writing. I passed two painters covering up scuffs on the white walls and a man with a floor machine busily tracing […]

Column

How COVID made our world smaller

Idan Levi

RHRQ interviewed Idan Levi, the Research Interest Group leader in Israel, to find out how research collaboration has changed over the last year and a half as the world went virtual. RHRQ: As the leader of Red Hat Research in Israel, you work with universities that are geographically dispersed, for example, Technion to the north, […]

Column

Why open source is integral to US AI research infrastructure

Heidi Dempsey

Peter Santhanam

The US is betting on open source to accelerate innovation in AI. Red Hat, the Mass Open Cloud, and IBM Research, as members of the AI Alliance, are supporting promising AI research for the National AI Research Resource Pilot. According to the 2025 AI Index Report1, GitHub contained approximately 4.3 million open source AI projects […]

Column

Matchmaking for engineers: how we learned to bring research and industry together in a way that works

Ilya Kolchinsky

Successful industry-academia relationships don’t just happen. Here’s what it takes to start a collaboration and make it work. As a research supervisor, one of my most important tasks is finding a good fit between engineers with a problem and academicians who can collaborate with them on a project to explore some aspect of that problem. […]

Column

Focus on trust | May 2024

Martin Ukrop

Elements of trust are nearly ubiquitous in software development, spanning from security concerns to trustworthiness and reliability. Current projects address the question of trust in many aspects. Red Hat Research and its university partners focus strategically on projects with the most promise to shape the future of how we use technology. Each quarter, RHRQ will […]