For those active in the early years of cloud computing, the challenges of open AI systems may feel strangely familiar. Do large-scale research collaborations have a lesson for today’s AI developers and engineers? We think so.
With the proliferation of cloud computing in the early 2000s, IT organizations faced a new challenge: how to manage services in multiple different infrastructures efficiently, especially when cloud applications could span multiple services, which made isolating the root cause of a problem very difficult. Proactive monitoring, alerting, and response for a single application might require analyzing statistics from a large regional database service, a national high-speed backbone, some key regional and local caching infrastructure, and the local servers and laptops at multiple offices or campuses that actually initiated and provided the end user’s window into the application. Each application had a different set of indicators used to determine whether it was operating correctly, so IT organizations customized their operations tools to correlate and communicate the state of those indicators to operators, whose goal it was to find and fix problems before users reported them.
If this sounds like an AI domain, you’re right.
In effect, IT organizations were building models of how healthy applications behaved and distributing that knowledge over multiple organizations and infrastructures so that it could be used to solve user problems. If this sounds like an AI domain, you’re right. But at a time when people had just made it through the Year 2000 issue focused on using dates with four digits instead of two in all these services, there were no practical options for applying ML techniques to observability and many obstacles to even collecting the necessary data to apply those techniques. Instead, engineers focused on developing home-grown tools that could help sort through the data flooding in from diverse sources, looking for connections to solve individual problems.
Fortunately, the research community had seen into this future already
Fortunately, the research community had seen into this future already, with the challenges of mass data storage, processing, and transfers that came with beginning construction on the Large Hadron Collider (LHC) in 1998. The LHC project enabled a large international community of over 10,000 very smart, enthusiastic researchers and developers in 100 countries to combine forces to build a distributed data collection and analysis tool for LHC experiments that spanned many infrastructures we would later come to call clouds. Emblematic of the large-scale collaboration needed for this challenge, the LHC tunnel itself belongs to no one country. It spans the France-Switzerland border near Geneva. The Worldwide LHC Computing Grid services launched in 2003, and fast development, test, deployment, and operations cycles for these services became the standard long before the term DevOps became popular in industry.
Critically, the LHC services required open source software for a large-scale distributed collaboration like this to work. Open source software, in turn, hastened the transition of these research-driven approaches for observing and managing large-scale data and distributed systems to some few enterprises that had started building their own large distributed infrastructures to manage services instead of science experiments. Google and Amazon began presenting at more academic conferences and publishing papers about cloud management, but they did not release their tools as open software. Splunk, which was founded in 2003 in San Francisco to provide a web-style interface to data collected and integrated in a central database, was an early market leader. Although Splunk was not open sourced, other tools like Kibana, Grafana, and Prometheus were evolving and began to be released widely in the 2010s.
Still, there was no agreed-upon definition of standard APIs and conventions for handling telemetry data consistently no matter what tool was being used to analyze it, so the barriers to large-scale collaboration between different organizations with applications that spanned multiple commercial services still existed. Open Telemetry, which finally began to be defined in 2019 with seed funding and technical committees in the Cloud Native Computing Foundation (CNCF), provided an increasingly popular way to avoid vendor lock-in for telemetry data and simplify software development to manage that data with a smaller set of APIs and conventions.
So why the history lesson? Because AI accelerated the pace of cloud development, but the problem of collecting, managing, and analyzing telemetry to support ever more complex services in the cloud still remains. The challenge of keeping not just the software that analyzes telemetry, but the data itself—along with the results and recommendations that AI service management software generates—visible to those who build, manage, and use cloud services from multiple providers is an even more challenging active research area than what we’ve faced before. Even just defining what constitutes a truly open AI system is controversial. The Open Source Initiative has been working on a definition of open source AI for years—as this issue goes to print, we’re looking at an October 2024 announcement.
The ways in which infrastructure, data, and AI systems from different owners can be combined or federated into large-scale services have continued to proliferate. AI developers, network engineers, application developers, and university researchers are currently working together in organizations like the AI Alliance and Horizon Europe on prototypes that are meant to keep telemetry open and manageable by those who use the cloud, as well as those who create and run the commercial services they depend on.
AI telemetry solutions can help manage the data flood to find connections and make better recommendations to users and service owners, but they need a feedback loop that is open and accessible to give collaborators a chance to understand and influence decisions that affect them and their data. Red Hat and our research partners are actively collaborating on early prototypes for this work—as one example, expect to hear more about the Co-ops project, a novel framework for collaborative development of training AI models at scale that moves beyond the limitations of traditional federated learning, in the coming months. We’re in the process of building multiple large clusters to support AI in this and other work in the Mass Open Cloud (MOC) that requires importing data from distributed sources, training and updating models, then distributing results worldwide. We’re also excited to see results from the SEMLA project (Securing Enterprises via Machine-Learning-based Automation), which is looking at ways to integrate LLMs into system development and network configuration.
After many years of encouraging collaboration among diverse groups of humans to build federated cloud and data management systems, we’re now bringing AI systems to the stakeholders’ table. Will we build a new kind of open DevOps between human developers and AI operators for telemetry? We don’t know yet, but we have some very smart humans and AI systems already colliding in a new kind of accelerator to find out.
In the quarterly
AI systems are table stakes in another major challenge: sustainable computing. This issue of RHRQ features an interview with John Goodhue, the director of the Massachusetts Green High-Performance Computing Center (MGHPCC), the datacenter supporting projects in the MOC Alliance. John spoke with Red Hat Principal Software Engineer Parul Singh, a leader in both open telemetry and sustainability projects, and Boston University postdoc Han Dong, a researcher focused on using AI/ML to dynamically balance performance and energy efficiency. They suggest that we are just at the beginning of making plans for sustainable energy use and understanding the complex variables involved in reducing the impact of computing on the environment, from climate change to power grids. You can’t manage what you can’t collect! Parul and Han also give us a peek into PEAKS—the Power Efficiency Aware Kubernetes Scheduler. PEAKS uses insights from Kepler, an open source tool for collecting resource utilization metrics, and machine learning algorithms to dynamically tune the Linux kernel in a way that optimizes energy efficiency while still meeting specified performance requirements.
Meanwhile, we continue to scale the capabilities of the MOC for AI in part because of the research and educational opportunities it can support. Democratizing access to educational opportunities through open source technology is something we’ve long been active in at Red Hat Research. Danni Shi gives us an update on the Open Education Project (OPE), headed by PI Jonathan Appavoo. OPE has moved to the MOC Alliance’s New England Research Cloud (NERC) and successfully supported classes for hundreds of students at Boston University. Danni describes the functional enhancements Red Hat engineers and BU faculty and students achieved in the past year to make OPE ready to host courses from universities around the world at an affordable price—and potential new users have already started to reach out.
As discussions about the possibilities and perils of AI take center stage in research and industry, these engineers and educators are already demonstrating how AI and large-scale systems are making a real-world impact, and what we need to measure to make sure that they continue to do that in many different collaborations.