Engineers on the Mass Open Cloud are continually developing new capabilities for the research resource. Here’s how.
For many people, “the cloud” is a very abstract entity. It’s a place to store their photos and data, and they don’t expect to have control over it beyond setting a password for access. In the world of academia and research, however, the cloud is not just a digital archive, it’s a fundamental environment for their work. The Mass Open Cloud (MOC), which I work on, is a powerful example. The MOC gives researchers in computing, healthcare, science, and other fields access to compute resources that were previously inaccessible. Collaborating with students, faculty, and MOC staff, I’ve seen the needs and challenges they face and worked directly with them to develop new solutions—often solutions that give us better tools and insights to benefit all users.
One thing I’ve observed working with these groups is that while it is easy to develop locally on a laptop, it’s not always sufficient. It’s hard to test scaling, projects may not have enough resources (e.g., GPUs, CPUs, or memory), and dependency issues may arise with people running different OSes. Developing in the final staging environment instead of a laptop also allows for an easier transition from development to production. By using the MOC’s Red Hat OpenShift environment for container orchestration and its OpenStack environment for virtual machines, users are empowered to do a lot more at a fraction of the cost of other infrastructure providers. Students and researchers can run compute-intensive workloads, deploy applications to a production environment, and collaborate seamlessly with not only their fellow students but also with Red Hat engineers sharing their real-world expertise—all without the need to own their own expensive hardware or match complex software/driver requirements.
Currently, the MOC provides access to FC430 and 830s for CPUs and A100s, V100s, and H100s for GPUs. All of those machines have large compute power and are used to build production OpenShift and OpenStack environments. They are also available to be leased as bare metal machines, where users can install their own operating systems. This is invaluable not just for researchers, but for industry engineers. For example, Red Hat’s Emerging Technology (ET) team has also used it for distributed model training development and other AI initiatives. The MOC also provides users with preconfigured telemetry to give helpful insight into what is going on on the hardware level, for example, in terms of performance, usage, and system health. To maintain this environment, ensure adherence to best practices, and—most important—continue upgrading to stay relevant and useful, a lot of work goes on behind the cloud.
Meeting diverse research requirements
When I started as an intern at Red Hat Research, I was assigned to work on Operate First, which, according to its GitHub page, was focused on “open sourcing operations on community-managed clusters.” Its goal was to create an environment for engineers to develop and deploy applications. Sounds awfully familiar: from my internship to coming back to Red Hat full time, there was a natural progression from a project creating community-managed clusters to working on the MOC, a collection of managed clusters for research.
As an engineer with Red Hat Research, part of my job is to explore new capabilities for the MOC. This includes efforts to improve the process of deploying new clusters, creating templates, writing runbooks for processes, and testing the use of Hypershift (Host control planes) to lower resource usage when deploying multiple clusters. This work is pivotal to many users, as their development work cannot be done in a large shared cluster, for example because of access level or specific network configuration requirements. I’m often tasked with getting new use cases working in the current environment. For instance, we were recently asked by the Red Hat Openshift AI business unit to integrate an MOC cluster into the vLLM CI pipeline as an environment where we could deploy machine learning workloads for developers. While I was able to get it running, the users employed a platform that had not previously been deployed and tested on the MOC.
When I joined the MOC team, the environment had three clusters: Production, where most users run their workloads in dedicated namespaces with specific resource allocations; Infra, an ACM hub that manages the other clusters; and Test, an environment to test upgrades and new operators before adding them to production. Over time, we have added an observability cluster, which aggregates metrics from all managed clusters and displays them using Grafana dashboards. The observability cluster, coupled with fine-grained access control, empowers users to examine bare metal metrics, information that is often integral to their development.
We also established several bespoke test clusters, providing crucial environments for groups whose development work demands full admin privileges or specific network configurations. This summer we built an academic cluster for classes taught using the Open Education Project (OPE). This cluster is upgraded less frequently so as not to interrupt classes. These additions represent a substantial enhancement to the MOC’s capabilities. They also provide a good example of how we work to address diverse user needs while enriching the research and development ecosystem.
They also provide a good example of how we work to address diverse user needs while enriching the research and development ecosystem.
Apart from the bespoke clusters, where the users of the cluster are given full admin access, all other clusters are managed by Openshift GitOps running on the infra cluster. This allows for the repo on GitHub to be used as the main source of truth. Outside of very minor testing on the test cluster, all changes to resources happen by creating or amending a YAML manifest in the OCP-on-NERC GitHub repo. When I first started, this was a very daunting repo to look at; however, as I got a greater understanding of OpenShift, the structure of the repo made a lot more sense. Not only does it allow tracking changes, it also permits reproducibility for the clusters themselves. With the correct infrastructure and Secrets in place, applying Kustomize for a specific cluster to a fresh OpenShift install should re-create that cluster. This also meant that post configuration of new MOC clusters could be templated, speeding up the process of deploying new clusters. Building templates was one of the first issues I worked on, and it resulted in this cluster-templating repo containing several Ansible files to create an overlay for a new cluster when provided with the correct variables. Applying the generated Kustomize file installs all the common operators and configurations shared across all MOC clusters.

Collaborative engineering to empower more users
When I was in college, I wasn’t just a computer science major, I was also a soccer captain, which has proven surprisingly useful. In my current role, I’ve had the opportunity to work with the MOC staff, students and teachers, and various groups within Red Hat, and each group has presented a unique set of challenges. They have different sets of requirements, from specific software versions to unique network setups, which can make a single solution for everyone impossible. Furthermore, deploying applications to OpenShift often requires extensive debugging to resolve any issues that may arise.
Fortunately, I find these challenges exciting. Much like coordinating a team on the field, addressing the challenges of engineering for the MOC requires constant communication, collaboration, and a shared effort to overcome issues and achieve a working solution.








