Red Hat Research Quarterly

From particles to prototypes: what we learn from managing open clouds

Heidi Dempsey

Heidi Picher Dempsey is the US Research Director for Red Hat. She seeks and cultivates research and open source projects with academic and commercial partners in operating systems, hybrid clouds, performance optimization, networking, security, and distributed system operations.

Related Projects

Article featured in

Red Hat Research Quarterly

November 2024

Download PDF

Subscribe now

In this issue

From the Director

From particles to prototypes: what we learn from managing open clouds

Heidi Dempsey

News

Observability cluster added to the MOC Alliance’s New England Research Cloud

Thorsten Schwesig

Christopher Tate

News

Publication highlights—November 2024

Interview

Power surge: the push for sustainability in high-performance computing and AI workloads

Parul Singh

Han Dong

Shaun Strohmer

Feature

Scaling the PEAKS of sustainability with insights from Kepler and machine learning

Han Dong

Parul Singh

Feature

The Open Education Project is ready to scale

Danni Shi

Feature

Open source authentication exposed: how open source developers perceive user authentication

Agáta Kružíková

Column

Red Hat and the MOC-A: creating the open source cloud for the AI era

Orran Krieger

Heidi Dempsey

For those active in the early years of cloud computing, the challenges of open AI systems may feel strangely familiar. Do large-scale research collaborations have a lesson for today’s AI developers and engineers? We think so.

With the proliferation of cloud computing in the early 2000s, IT organizations faced a new challenge: how to manage services in multiple different infrastructures efficiently, especially when cloud applications could span multiple services, which made isolating the root cause of a problem very difficult. Proactive monitoring, alerting, and response for a single application might require analyzing statistics from a large regional database service, a national high-speed backbone, some key regional and local caching infrastructure, and the local servers and laptops at multiple offices or campuses that actually initiated and provided the end user’s window into the application. Each application had a different set of indicators used to determine whether it was operating correctly, so IT organizations customized their operations tools to correlate and communicate the state of those indicators to operators, whose goal it was to find and fix problems before users reported them.

If this sounds like an AI domain, you’re right.

In effect, IT organizations were building models of how healthy applications behaved and distributing that knowledge over multiple organizations and infrastructures so that it could be used to solve user problems. If this sounds like an AI domain, you’re right. But at a time when people had just made it through the Year 2000 issue focused on using dates with four digits instead of two in all these services, there were no practical options for applying ML techniques to observability and many obstacles to even collecting the necessary data to apply those techniques. Instead, engineers focused on developing home-grown tools that could help sort through the data flooding in from diverse sources, looking for connections to solve individual problems.

Fortunately, the research community had seen into this future already

Fortunately, the research community had seen into this future already, with the challenges of mass data storage, processing, and transfers that came with beginning construction on the Large Hadron Collider (LHC) in 1998. The LHC project enabled a large international community of over 10,000 very smart, enthusiastic researchers and developers in 100 countries to combine forces to build a distributed data collection and analysis tool for LHC experiments that spanned many infrastructures we would later come to call clouds. Emblematic of the large-scale collaboration needed for this challenge, the LHC tunnel itself belongs to no one country. It spans the France-Switzerland border near Geneva. The Worldwide LHC Computing Grid services launched in 2003, and fast development, test, deployment, and operations cycles for these services became the standard long before the term DevOps became popular in industry.

Critically, the LHC services required open source software for a large-scale distributed collaboration like this to work. Open source software, in turn, hastened the transition of these research-driven approaches for observing and managing large-scale data and distributed systems to some few enterprises that had started building their own large distributed infrastructures to manage services instead of science experiments. Google and Amazon began presenting at more academic conferences and publishing papers about cloud management, but they did not release their tools as open software. Splunk, which was founded in 2003 in San Francisco to provide a web-style interface to data collected and integrated in a central database, was an early market leader. Although Splunk was not open sourced, other tools like Kibana, Grafana, and Prometheus were evolving and began to be released widely in the 2010s.

Still, there was no agreed-upon definition of standard APIs and conventions for handling telemetry data consistently no matter what tool was being used to analyze it, so the barriers to large-scale collaboration between different organizations with applications that spanned multiple commercial services still existed. Open Telemetry, which finally began to be defined in 2019 with seed funding and technical committees in the Cloud Native Computing Foundation (CNCF), provided an increasingly popular way to avoid vendor lock-in for telemetry data and simplify software development to manage that data with a smaller set of APIs and conventions.

So why the history lesson? Because AI accelerated the pace of cloud development, but the problem of collecting, managing, and analyzing telemetry to support ever more complex services in the cloud still remains. The challenge of keeping not just the software that analyzes telemetry, but the data itself—along with the results and recommendations that AI service management software generates—visible to those who build, manage, and use cloud services from multiple providers is an even more challenging active research area than what we’ve faced before. Even just defining what constitutes a truly open AI system is controversial. The Open Source Initiative has been working on a definition of open source AI for years—as this issue goes to print, we’re looking at an October 2024 announcement.

*The Large Hadron Collider tunnel, built by the European Organization for Nuclear Research (CERN)*

The ways in which infrastructure, data, and AI systems from different owners can be combined or federated into large-scale services have continued to proliferate. AI developers, network engineers, application developers, and university researchers are currently working together in organizations like the AI Alliance and Horizon Europe on prototypes that are meant to keep telemetry open and manageable by those who use the cloud, as well as those who create and run the commercial services they depend on.

AI telemetry solutions can help manage the data flood to find connections and make better recommendations to users and service owners, but they need a feedback loop that is open and accessible to give collaborators a chance to understand and influence decisions that affect them and their data. Red Hat and our research partners are actively collaborating on early prototypes for this work—as one example, expect to hear more about the Co-ops project, a novel framework for collaborative development of training AI models at scale that moves beyond the limitations of traditional federated learning, in the coming months. We’re in the process of building multiple large clusters to support AI in this and other work in the Mass Open Cloud (MOC) that requires importing data from distributed sources, training and updating models, then distributing results worldwide. We’re also excited to see results from the SEMLA project (Securing Enterprises via Machine-Learning-based Automation), which is looking at ways to integrate LLMs into system development and network configuration.

After many years of encouraging collaboration among diverse groups of humans to build federated cloud and data management systems, we’re now bringing AI systems to the stakeholders’ table. Will we build a new kind of open DevOps between human developers and AI operators for telemetry? We don’t know yet, but we have some very smart humans and AI systems already colliding in a new kind of accelerator to find out.

In the quarterly

AI systems are table stakes in another major challenge: sustainable computing. This issue of RHRQ features an interview with John Goodhue, the director of the Massachusetts Green High-Performance Computing Center (MGHPCC), the datacenter supporting projects in the MOC Alliance. John spoke with Red Hat Principal Software Engineer Parul Singh, a leader in both open telemetry and sustainability projects, and Boston University postdoc Han Dong, a researcher focused on using AI/ML to dynamically balance performance and energy efficiency. They suggest that we are just at the beginning of making plans for sustainable energy use and understanding the complex variables involved in reducing the impact of computing on the environment, from climate change to power grids. You can’t manage what you can’t collect! Parul and Han also give us a peek into PEAKS—the Power Efficiency Aware Kubernetes Scheduler. PEAKS uses insights from Kepler, an open source tool for collecting resource utilization metrics, and machine learning algorithms to dynamically tune the Linux kernel in a way that optimizes energy efficiency while still meeting specified performance requirements.

Meanwhile, we continue to scale the capabilities of the MOC for AI in part because of the research and educational opportunities it can support. Democratizing access to educational opportunities through open source technology is something we’ve long been active in at Red Hat Research. Danni Shi gives us an update on the Open Education Project (OPE), headed by PI Jonathan Appavoo. OPE has moved to the MOC Alliance’s New England Research Cloud (NERC) and successfully supported classes for hundreds of students at Boston University. Danni describes the functional enhancements Red Hat engineers and BU faculty and students achieved in the past year to make OPE ready to host courses from universities around the world at an affordable price—and potential new users have already started to reach out.

As discussions about the possibilities and perils of AI take center stage in research and industry, these engineers and educators are already demonstrating how AI and large-scale systems are making a real-world impact, and what we need to measure to make sure that they continue to do that in many different collaborations.

SHARE THIS ARTICLE

Changing the world, one lesson at a time

Matej Hrušovský

Why teaching more teachers is essential to computer science education.

Feature

Don’t blame the developers: making security usable for IT professionals

Martin Ukrop

Historically, usability studies have looked mostly at end users, doing focus groups or user testing with customers or the general public. This process often neglected developers, system administrators, and other IT professionals and the systems they use day to day.

News

Hybrid cloud, edge, and security research featured at DevConf.CZ 2023

After more than three years of strictly virtual meetings, DevConf.CZ has finally returned to in-person events. The Brno-based hybrid gathering is an annual, free, Red Hat sponsored community conference for developers, admins, DevOps engineers, testers, documentation writers, and other contributors to open source technologies. Presentations highlighted progress made via industry-university collaboration in areas critical to […]

Interview

When good models go bad: Minimizing dataset bias In AI

Sanjay Arora

Sanjay Arora is a data scientist at Red Hat and a member of the Greater Boston Research Interest Group with particular interests in AI and machine learning. For RHRQ he interviewed Kate Saenko, a faculty member at Boston University and consulting professor for the MIT-IBM Watson AI Lab, about managing bias in machine learning datasets and the problems that remain unsolved.

Feature

Mental models: Qualitative research to design for Red Hat OpenShift users

Carl Pearson

Brian Dellascio

Sarahjane Clark

To design effectively for our users, we need to learn more about them. If we don’t, we may make a product that our users can’t be efficient in, or worse, a product that our users have no need for in the first place.

Feature

BigDataStack delivers with contributions from industry and university partners

Yosef Moatti

Oshrit Feder

Guy Khazma

Gal Lushi

Paula Ta-Shma

Luis Tomás Bolivar

Miki Kenneth

Josh Salomon

Data skipping and network performance improvement technologies prove their value in data-intensive applications.

Feature

Meet Perun: a performance analysis tool suite

Jiří Pavela

Tomáš Fiedor

Jiří Hladký

Tomáš Vojnar

How do you turn a research project into an industry tool? Learn how the creators of Perun built a better performance analysis toolkit then brought it from academia to real-world implementation. Everyone has a horror story about poor performance in a continuously evolving product. Managing the performance of reasonably complex software is simply a difficult […]

Project Updates

Research project updates—May 2022

Each quarter, Red Hat Research Quarterly highlights new and ongoing research collaborations from around the world. This quarter we highlight collaborative projects in Israel at The Technion, The Ben Gurion University of The Negev, Ariel University, Reichman University, and The Hebrew University. Contact academic@redhat.com for more information on any project described here, or explore more research […]

Feature

Unleashing the potential of Function as a Service in the cloud continuum

Luis Tomás Bolivar

José Castillo Lema

The PHYSICS project demonstrates the value of the FaaS paradigm for application development and data analysis. Here’s how we enhanced the infrastructure layer. The difficulty of scaling, optimizing, and maintaining infrastructure makes cloud computing too complex or resource-intensive for many developers and data scientists. The Function-as-a-Service (FaaS) model (often called serverless computing, generically) allows users […]