Red Hat Research Quarterly

Having our cake and eating it too: HPC meets enterprise AI

Red Hat Research Quarterly

Having our cake and eating it too: HPC meets enterprise AI

about the author

Orran Krieger

Orran Krieger is the Director of Red Hat Research while on leave of absence from Boston University, where he is a professor in the Department of Electrical and Computer Engineering. He is a founding lead of the Mass Open Cloud Alliance (MOC-A).

Article featured in

The convergence of high-performance research computing and general-purpose IT is turning conventional wisdom on its head.

High-Performance Computing (HPC) and general-purpose IT infrastructure have always been very different. This difference is very visible at top universities, which maintain both types of environments: HPC clusters dedicated to research, often managed by a small team of two or three staff overseeing thousands of machines, and enterprise IT environments, where perhaps 50 operations staff manage hundreds of computers running a wide array of diverse, mission-critical workloads.

The differences go far beyond operational complexity. HPC environments are generally dedicated to a smaller number of large-scale workloads—say, computation for particle accelerator experiments—whose results will be widely shared, while enterprise environments have strong compliance and security requirements around many users. HPC applications may access massive datasets, but each application is specialized for data from a specific domain. General-purpose environments support many different types of users and applications, often combining data from diverse sources ranging from file systems, distributed databases, data warehouses and streaming live data from sensors. HPC applications rely on a limited set of libraries that remain fairly stable, while general-purpose environments get frequent updates, from security patches to constantly evolving services and libraries. HPC jobs are batch scheduled (e.g., managed using SLURM), where all the computers in the cluster may work on the same problem simultaneously, while general-purpose environments typically support loosely coupled independent tasks and interactive workloads orchestrated using platforms like Kubernetes. 

HPC generally uses free software and on-premises experts for support, while general-purpose IT pays licensing fees and invests in multitiered processes with vendor partners to ensure 24×7 support and escalation. Finally, to handle computation failures, HPC work has focused on checkpointing, while general-purpose computing requires application state to be tracked and handled independently from compute node state in a cluster (see Pets vs. Cattle). 

The Mass Open Cloud (MOC), inspired by Project Kittyhawk (Jonathan Appavoo et al.), has, since its inception, been based on the radical hypothesis that these two worlds could converge into a common platform where both HPC and cloud computing benefit from a set of common services, business models, operational disciplines, and capabilities. The wide-scale adoption of AI is increasingly making this radical hypothesis established practice. For example:

  • Neoclouds like CoreWeave, Nebius AI, Lambda, Crusoe, and Voltage Park support both SLURM batch scheduling and Kubernetes dynamic container clusters.
  • The Department of Energy’s Genesis Mission is applying National Lab HPC supercomputers to AI challenges. 
  • Data at massive scale is needed to train and tune successful AI models, and much of that data requires security and data ownership protection different from traditional academic HPC.
  • The rapid pace of change in AI, and its broad use, means that systems are changing at a pace HPC systems never experienced before. 
  • Hardware and software that was developed and improved for general-purpose enterprise use, with strict compliance and security requirements, is increasingly critical for real-life AI applications.

The implications of this convergence, especially in an environment for academic research, are profound. AI startups are popping up from research universities faster and faster, driving the bleeding edge of new platforms. Industry now tracks what researchers are doing closely, and many AI innovations move from universities into general usage in weeks. Research universities are adopting AI long before enterprise customers, which means that the workloads of universities, and the systems needed to support them, are important areas of interest for collaborative industry and computer systems research. 

Research universities are adopting AI long before enterprise customers, which means their workloads, and the systems supporting them, are important areas for collaborative industry and computer systems research.

In February 2024, Governor Maura Healey formed the Massachusetts AI Strategic Task Force, which recommended the establishment of the Massachusetts AI Hub to serve as a nexus for AI innovation and facilitate cutting-edge collaboration between government, industry, academia, and nonprofits. Key interrelated initiatives of the AI Hub to unlock innovation in the commonwealth include creating the AI Compute Resource (AICR) infrastructure to supply compute capacity, a Data Commons to unlock the value of shared, high-quality, and responsibly governed data across various sectors, and programs to support and accelerate AI startups. 

In this issue, we feature a conversation between Stefanie Chiras, Senior Vice President of the AI Innovation Hub at Red Hat, and Chris Sedore, Vice President and CIO at Boston University—two leaders helping shape the Mass Open Cloud (MOC) around AI as a catalyst for research, innovation, and economic growth. The AI-driven convergence of HPC and general purpose computing positions the MOC to play an important role in supporting initiatives like the Massachusetts AI Hub. The MOC is evolving its cloud-native services to support research, education, and startup communities with the security, compliance, and operational rigor modern AI workloads demand. 

Under Chris’s leadership, early next year Boston University’s enterprise IT organization will assume operational responsibility for these services, working closely with Red Hat to rapidly enable the compliance and security regimes required for AI use cases, particularly in health care and data-intensive domains. Stefanie, who leads the Red Hat partnership with the AI Hub, is engaging AI startups to take advantage of this infrastructure. This effort is closely aligned with the Mass Data Commons, where strong security controls and enterprise-grade services are essential for governing access to sensitive data used in AI workloads. 

Together, Chris and Stefanie are helping galvanize a broad set of industry and research partnerships around the MOC. In 2026, this momentum will continue with the launch of i-Scale, an NSF Industry–University Research Center, with founding partners that include Red Hat, SHI, Pure Storage, Lenovo, Cisco, and G Research. Read this conversation for a deeper look at how these leaders see the convergence of AI, infrastructure, and partnership shaping what comes next.

Also in this issue, learn how a partnership between Red Hat engineers and researchers at the Complutense University of Madrid is streamlining data processing and visualization for astronomers, who work with massive, distributed datasets (“Concurrent, scalable, and distributed astronomy processing in the AC3 framework”). 

Members of that collaboration also worked with a different EU-based team of engineers and researchers on the development of an intelligent multicluster scheduler to automatically handle dependent Kubernetes resources and ensure network connectivity between distributed services (“Building an intelligent multicluster scheduler with network link abilities”). On the topic of no-black-boxes, researchers at the Brno University of Technology, a long-time Red Hat Research partner, are working on developing a model for an accurate, traceable AI Bill of Materials (AIBOM), usable not just for compliance but for security analysis (“Unpacking AI’s black box: why authenticity and traceability must be built in”).

Finally, I’m excited to point readers to a follow-up to US Research Director Heidi Dempsey’s “From the Director” column in the previous issue of RHRQ, which introduced the National AI Infrastructure Research Resource (NAIRR) Pilot Program. In this issue, Heidi and AI Alliance contributor Peter Santhanam announce eight advanced AI research projects to be supported collaboratively by Red Hat, IBM Research, and the Mass Open Cloud (“Why open source is integral to US AI research infrastructure”). 

Providing computing resources and open source AI assets to NAIRR Pilot participants gives us another opportunity to advance computing for widespread public benefit.

SHARE THIS ARTICLE

More like this