Red Hat Research Quarterly

Pushing the boundaries of AI development

Red Hat Research Quarterly

Pushing the boundaries of AI development

about the author

Heidi Dempsey

Heidi Picher Dempsey is the US Research Director for Red Hat. She seeks and cultivates research and open source projects with academic and commercial partners in operating systems, hybrid clouds, performance optimization, networking, security, and distributed system operations.

Article featured in

Red Hat Research Quarterly

Spring 2026

In this issue

A shared national AI research infrastructure may be coming to a galaxy not so far away.

Human time scales are slow—really slow. In the time it takes to type that sentence, one of the H100 GPUs powering a nearby academic datacenter has roughly 10 billion cycles to consider its place in the universe. Of course, it isn’t actually doing that, it’s just waiting for me to ask it to do “something, anything, please!-now-I-am-so-bored-I-am-going-to-sleep….” When I designed hardware in the distant past, I imagined my board’s CPU drumming fingers impatiently on its silicone desk, one finger each cycle, whenever the CPU was idle. Imagine the din of silicone clicking we’d hear in modern AI datacenters if that were the case!

Ryan Gosling
When it comes to AI development, we are all already this astronaut.(Photo: Raph_PH CC BY 2.0)

Why am I thinking about this, other than the fact that I am a systems nerd? It’s because of the National Science Foundation and space travel. To be more specific, it’s because I was participating in the second NSF National AI Research Resource (NAIRR) annual meeting and reading Project Hail Mary, a space travel science fiction novel, on the plane. In the book (now a film with Ryan Gosling playing the lead), a science-teacher-turned-astronaut travels 11.9 light years from earth in roughly 13 years (from Earth’s perspective) to save the planet. To do this (not a spoiler, don’t worry), he has to send information back to Earth, but holding any type of interactive conversation over those distances is impossible in a universe where the speed of light limits how fast information can travel. From a GPU’s perspective, its clock speed is the limiting factor on fast information transfer. But conveying information to a human who is, in terms of clock ticks, a galaxy away, is glacially slow. Given that this interactive use case for AI is by far the most popular one at this stage of AI development, we are all already that astronaut on a faraway world.

Building systems to run AI models and applications takes much longer than using them (from a human’s perspective). The first year of the NAIRR pilot program mostly focused on creating joint industry-academic-government collaborations to foster shared infrastructure for research and development of these systems, as well as experimenting with applications in sciences and engineering that could benefit from machine learning. I say ML instead of LLM, because the NAIRR program supports a diverse set of models driven by science needs, not all of which are LLMs. General-purpose CPUs, the lingua franca in computer applications for decades, were not optimized for ML models. To be honest, neither were GPUs, but their design was closer to that needed to complete massive computing work to (relatively) quickly train or fine-tune models. So each NAIRR pilot scrambled for GPUs to build out their infrastructure. GPU vendors had a distinct advantage at this stage, which is why their participation in NAIRR was so critical.

But wait—hadn’t the large national labs already built supercomputers that could be used for ML models? Yes, but because the open source software development for AI was overwhelmingly driven by open source cloud computing in the commercial environment, not by HPC in supercomputer architectures, much of the available open source ML software ran in Kubernetes clusters. On top of that, GPU manufacturers began to architect switches and high-speed connections to mesh GPUs, developing special (and sometimes partially open) software to manage and configure this critical part of large-scale compute systems. Some HPC supercomputers thus started to add GPU clusters to their architecture, as did some academic Kubernetes research clouds, so the NAIRR pilots advanced. From a national research point of view, NAIRR resulted in several different types of pilots, instead of a single US centralized design. This was beneficial, even though it made actually coordinating the multiple pilots and research much more challenging for NAIRR participants and the NSF.

If this sounds familiar to those of you who’ve been in the open source ecosystem for a while, your deja vu is justified.

Hardware development for commercial AI systems focused on an extremely small number of vendors for ML-optimized compute units, compared to the number of vendors who made general-purpose CPUs. Similarly, most software development relied on NVIDIA’s CUDA software. Is this an existential threat for open source development in the ML world? We don’t know yet, but with multiple different pilot architectures, the NAIRR program provides meaningful support for keeping AI development and systems optimization options open for multiple vendors. If this sounds familiar to those of you who’ve been in the open source ecosystem for a while, your deja vu is justified. The open source advantages of making code, computing, and data handling portable so that it can run anywhere should be a long-term goal for AI development as well. Several researchers who spoke at the recent NAIRR annual meeting recognized this and emphasized the importance of open source to advance their fields.

Early on, the NAIRR program recognized that there would be variation in what different sciences needed from ML for their domain applications and that many domains would also need to meet stricter data privacy and security requirements (for example, HIPAA regulations for medical sciences). Accordingly, pilots were organized into four focus areas: research, security, data and models, and classroom use (see the NAIRR website for descriptions of these areas). With about $100 million in private sector in-kind contributions, as well as 14 federal agency partners, the NAIRR program has thus far resulted in over 600 research and education projects, support for over 6,000 students, substantial progress in pilots for privacy/security preserving infrastructure, and a clearinghouse for open data, models, and AI experimentation resources; see the NAIRR pilot resources and the NSF two-year progress report for more detail.

The Deep Partnership panel at the second annual NSF National AI Research Resource (NAIRR) meeting
The Deep Partnership panel at the second annual NSF National AI Research Resource (NAIRR) meeting

In 2026, the NSF is preparing to transition NAIRR from pilots to long-term sustainable national AI assets for research and education. The design challenges in each of the four focus areas remain significant, but the results thus far have been very exciting. We have learned a lot from the Red Hat NAIRR deep partnerships, and we’ve explored new questions with experts from all four NAIRR focus areas. 

Gene Yao of UC San Diego and the Sanford Laboratories for Innovative Medicines shared work on mRNA therapeutics research that can change lives through discovery and development of treatments for genetic disorders. Our infrastructure work pushes the boundaries of systems and data for clouds in the NAIRR context, and the goal of supporting diverse architectures for AI has us working on many open critical infrastructure questions. Can we enable federated learning with appropriate data protection and privacy between computing infrastructures as a routine function? Will it be feasible to pursue development in one compute/data environment and then apply it to a project in another environment? What type of peering and data exchange will be allowed for this type of functionality, and how do we evolve structures to capture and present those requirements in a safe exchange?

Can we develop a language that gives an application more information about relevant characteristics of different peered services (e.g., whether a service configures its GPUs to sleep when idle, thus conserving energy but making workload ramp-ups slower)? How would this work in a system where federated learning allowed an application to mix HPC and cloud services from different providers with different advantages according to a users’s stated preference? Would users actually want to communicate their preferences (run fast vs. save energy) if this were possible? How do we design and deploy these environments while making them easier to discover and less energy-hungry so we protect the long-term health of our planet? If you close your eyes, you will see as many bright queries as there are stars in the sky. Our journey of discovery has light years to go.

SHARE THIS ARTICLE

More like this