Red Hat Research Quarterly

What should open source AI mean?

Jason Brooks

Jason Brooks is a Senior Manager, Community Architects and Infrastructure, in the Red Hat Open Source Program Office.

about the author

Kimberly Craven

Kimberly Craven is the Senior Director of the Red Hat Open Source Program Office

about the author

Erik Erlandson

Erik Erlandson is the AI and Data Science lead of the Red Hat Office of the CTO.

about the author

Cara Delia

Cara Delia is a Red Hat Senior Principal Community Architect and AI Team Lead in the Red Hat Open Source Program Office.

about the author

Michal Rosen-Zvi

Michal Rosen-Zvi is the Director of Healthcare and Life Sciences at IBM Research, Chief Scientist at the IBM-Cleveland Clinic Discovery Accelerator, and an adjunct professor of computational medicine at the Hebrew University.

about the author

Walter J. Scheirer

Walter J. Scheirer is the Dennis O. Doughty Collegiate Associate Professor of Computer Science and Engineering at the University of Notre Dame and a faculty affiliate of Notre Dame’s Technology Ethics Center. In December 2023, he published A History of Fake Things on the Internet (Stanford UP), an exploration of the history of “fake news” and the technical advances that make new forms of deception possible.

Article featured in

Red Hat Research Quarterly

Summer 2025

Download PDF

Subscribe now

The meaning of open source matters for AI. Our roundtable of experts discusses why, how, and for whom.

There is general agreement in the open source community that open source is crucial for AI development, both to accelerate innovation and to make it safer and more accessible. At the same time, there is only limited agreement on what open source in AI looks like. Early this year, Red Hat CTO Chris Wright smartly sidestepped the big definition conundrum when offering Red Hat’s perspective, explaining that Red Hat views currently the minimum criteria for calling AI open source to be open source-licensed model weights combined with open source software components and keeping the conversation open.

Discussing the question of what we can reasonably call open source AI has been on the RHRQ wish list for well over a year, but all things AI have moved so quickly that every new approach was old news before we could reach the end of a publication cycle. A conversation with Kimberly Craven, Senior Director, Red Hat Open Source Program Office, convinced me that the definition problem wasn’t too big for a good conversation—it was too small.

Instead, we opened the floor to multiple perspectives—industry engineers, university researchers, community architects—to look at the challenges and benefits of putting an open source modifier next to AI in the real world. One of the critical themes that emerged is foregrounding who benefits from open source in AI and what they need most. If open source is fundamentally about empowerment, we have to understand what empowering people to use, develop, and share AI tools means in the real world. Do you think we’re missing an important perspective? Let us know! —Shaun Strohmer, Ed.

“Instead of fixating on whether AI can fully conform to the traditional definition of open source, we should channel the spirit of open collaboration to guide its development.“

— Kimberly Craven
Senior Director, Red Hat Open Source Program Office (OSPO)

Open source has served as a catalyst for decades, enabling rapid innovation and widely adopted standards through community collaboration. We’ve seen its impact across operating systems, integration technologies, application deployment, and more. It can have a similar transformative effect in the field of artificial intelligence—if we approach it thoughtfully. While today’s AI systems owe much of their existence to open source, that doesn’t mean every aspect of AI neatly fits within the traditional definition. Open source software is code that is designed to be publicly accessible—anyone can see, modify, and distribute it. While this principle can apply to parts of AI systems, it doesn’t extend to all of them.

For example, when AI components are clearly software—like inference tools or frameworks such as PyTorch and TensorFlow—it’s straightforward to license them openly, allowing others to view, adapt, and redistribute the code. But pre-trained models and the datasets they’re trained on present challenges. These components lack the same level of transparency and aren’t as easily encapsulated within code.

As an industry, we’re still grappling with what open source truly means in the context of AI. Instead of fixating on whether AI can fully conform to the traditional definition of open source, we should channel the spirit of open collaboration to guide its development. This means creating frameworks that prioritize transparency, enable meaningful contribution, and ensure that the benefits of AI are accessible to all.

Open source principles—transparency, collaboration, and shared ownership— have consistently driven innovation and trust in technology, and they are just as essential in shaping AI systems that are accountable, adaptable, and broadly beneficial. By applying these values wherever possible, we can help ensure that AI evolves in ways that reflect the needs and values of the wider community.

“Subsidiarity should bethe guiding principle for open source AI.“

— Walter J. Scheirer
Dennis O. Doughty Collegiate Professor of Engineering, University of Notre Dame

There is much hand-wringing over the competing definitions of open source AI these days. We find some stakeholders attempting to defend corporate policies (e.g., open weights but not open training data, which keeps part of the training process a trade secret) and others quibbling over licensing (e.g., GPL compliance) in support of a preferred choice. From a social benefit perspective, a definition that encourages multiple modes of openness, tailored to different communities of users who have different requirements for controlling an AI system, may have more value. This is encapsulated in the idea of subsidiarity, or the principle of social organization that prefers to give control of a system to the most localized community that is able to effectively make its own decisions.

We can adapt this idea to AI models by asking, “How much control is needed if they are released to the public?” For example, if the most localized community is a special interest group that wants access to a customized LLM virtual assistant to help them organize, they may not need the original training data to create their LLM from an available pre-trained model. In this case, an open architecture and open weights from a company that prefers to keep its data secret can still facilitate fine-tuning for the task at hand. If university researchers interested in studying the exact operation of an AI model are the community, they will need the training data, architecture, and pre-trained weights. Both of these scenarios are open source, in some sense, but with varying degrees of openness determined by community need. Stated concisely, we can define open source AI as openness facilitating subsidiarity, giving primary consideration to the communities who are users of the technology.

“A world where LLMs are transparent and reproducible provides space for innovative and profitable business models.“

— Erik Erlandson
AI and Data Science Lead, Red Hat Office of the CTO (OCTO)

Much of the industry discussion regarding how best to define open source AI has focused on the definition of open weights: LLMs whose parameters are published using a permissive open license such as Apache 2.0. Open weights are a powerful enabler for building AI into open source software, and Red Hat uses its role as open source leader to promote the use of open-weight LLMs, such as Mistral and IBM’s Granite model series.

However, open model weights don’t fully capture the open values of transparency or reproducibility. Using an analogy from software: open source projects, as defined by the Open Source Initiative, provide individuals with the ability to build an open project from source code. The implicit guarantee is that the project build process is both transparent and reproducible. Relatively few LLM projects publish detailed information about model pre-training data and the code used to run the training. While there have been some attempts to codify transparency, such as Stanford’s Foundation Model Transparency Index, transparency and reproducibility are not yet common evaluation criteria for LLMs. Fully transparent and reproducible models can benefit the entire software industry, by allowing everyone to audit training data and methodologies for critical alignment properties, such as bias, safety, and fairness.

A world where LLMs are transparent and reproducible provides space for innovative and profitable business models. Open source software gives us another analogy here. Red Hat founded its business on packaging RHEL, even though in principle customers could build RHEL from source themselves. Possessing the raw information necessary to build an LLM in a transparent and reproducible manner is different from having the compute resources, in-house expertise, and operational experience to do so, and there is room for business models to thrive in this gap.

“Open sourcing AI models, datasets, and pre-trained models is a critical aspect of successful collaboration.“

— Michal Rosen-Zvi
Director of Healthcare and Life Sciences at IBM Research, Chief Scientist, IBM-Cleveland Clinic Discovery Accelerator, Adjunct Professor of Computational Medicine at the Hebrew University

With the advancement of AI and foundation models like LLMs in particular, research into disease and healthcare is making significant strides. These technologies are unlocking the discovery of new disease mechanisms, novel therapeutic targets, and innovative drug molecules. The exploration space is immense, encompassing massive datasets, diverse data representations, and a wide array of transformer-based architectures to explore.

The promise lies in harnessing vast resources—genomic data, RNA-based lab tests, protein databases, drug-protein interactions, and protein-protein interactions collected across species from bacteria and viruses to humans—to train these powerful models. The architectural landscape is equally rich, including encoder-only models, encoder-decoder frameworks, and autoencoders, among others. Unlike text data, biomedical information spans expression levels and sequences with complex three-dimensional structures, each demanding unique representation strategies.

This monumental effort can only succeed through community collaboration, as no single group possesses all the necessary talent, data, and resources to navigate this complex landscape alone. Open sourcing AI models, datasets, and pre-trained models is a critical aspect of this collaboration. In an encouraging development, many corporations and research groups have already begun sharing their technologies and forming new alliances—such as the AI Alliance and its Drug Discovery Working Group—with the shared goal of accelerating progress.

Together, by building a foundation of open collaboration and shared innovation, we stand on the cusp of transforming our understanding of disease and revolutionizing the path to treatment.

“Open source communities are instrumental in advancing the ethics and transparency of AI development.“

— Cara Delia
Red Hat Senior Principal Community Architect —AI Team Lead

Without the contributions of open source, the technology landscape of today would look very different. Complacency is not open source: curiosity is. A vital aspect for open source AI to thrive is collaboration through sharing both knowledge and resources, and the collective open source community are curious problem solvers. True open source projects offer unrestricted access to their source code, allowing for complete transparency, modification, and redistribution. This openness is about accessing the code and contributing to its development, fostering a community-driven approach that is the hallmark of open source.

Open source is about more than just code—it’s about people sharing knowledge and best practices. That’s why open source communities are instrumental in advancing the ethics and transparency of AI development. They are a group that genuinely cares about making things better for everyone else and themselves. When open source contributors come together, they build something great. Community provides a chrysalis space to help identify and think harder about the long-term effects of AI use cases. Through collaboration and peer review, experts can ensure that AI systems are developed in a responsible and ethical manner, striving to mitigate hidden biases or unintended consequences.

To democratize access to cutting-edge technology, encourage collaboration among experts in the field, and enable the development of sophisticated solutions to address real-world problems, we have to empower communities. This has always been true, and it is no different when it comes to AI. Community is the key to making it possible for anyone to build innovative AI software anywhere.

“Major differences in the way AI models and software relate to their ingredients … demand a different approach to accessing the benefits of open source in each area.“

— Jason Brooks
Senior Manager, Community Architects and Infrastructure, Red Hat Open Source Program Office (OSPO)

In the realm of software development, open source has been such a powerful and positive force that it makes sense to apply open source principles in as many other disciplines as possible. AI is no exception. Compared to fields such as healthcare or hardware, where our ideas about open source can’t always be ported over trivially, AI seems like an almost perfect fit. It consists of ordinary software and software-generated models that we develop from digital source material and use much like we would any other software.

The challenge of developing open source AI is often presented as if only the base ingredients of an AI model were freely available, we’d be able to access all the benefits of open source software. However, major differences in the way AI models and software relate to their ingredients, and differences in what’s required to bring those ingredients together to form a usable product, demand a different approach to accessing the benefits of open source in each area. In addition, anyone with a laptop computer can download the source code for the Linux kernel, apply a patch that they or someone else has written to enact a particular change, compile that code, and boot into the modified kernel. A similar sort of exercise, if performed from the base training data of a typical LLM, would require an outlay of hardware resources and personnel accessible only to a very large organization.

From an open source software point of view, centering one’s efforts around pre-trained models may feel unsatisfying, like buying a car with the hood welded shut, as the old saying goes. Fortunately, there’s a great deal of study, experimentation, modification, and collaboration going on around freely downloadable, openly licensed models today, and it’s accessible to small organizations and individuals. For instance, while it may be infeasible to study a large language model by poring through the raw training data that produced it, you can audit a model by inferencing it. Retraining an LLM from the ground up is possible for only the largest organizations, but small projects and individuals can and do successfully modify and redistribute the pre-trained models to access the benefits of AI.

SHARE THIS ARTICLE

A data-driven approach for analyzing Common Criteria and FIPS 140 security certificates

Jaroslav Řezník

Petr Švenda

Seccerts is a much-needed tool for data scraping and analysis of security certificates, but creating it was harder than expected. Here’s why. Security certification documents from certification schemes like Common Criteria (CC) and the National Institute of Standards and Technology (NIST) Federal Information Processing Standard (FIPS) contain valuable, detailed information. Most of it, however, is […]

Feature

Blocks, microworlds, puzzles, and adaptivity: teaching programming effectively

Tomáš Effenberger

Bayesian statistical methods can make predictive data analysis more accurate. In this article, we evaluate possible solutions to the challenge of refining and increasing the value of high-volume data streams.

Feature

PyLadies, welcome to open source!

Petr Viktorin

How did a group of three library students become part of an international force for promoting programming education? A Red Hatter who was there has the story.

Feature

QUBIP and the transition to post-quantum cryptography

Gordon Haff

Quantum computing could put secure communication at risk sooner than you think. Current research aims to solve the problem before it starts. Post-quantum cryptography (alternatively, quantum-resistant cryptography) probably consumes more bandwidth than it should in quantum computing discussions. That’s because the potential to incrementally improve the efficiency of important but mundane tasks like optimizing logistics […]

Feature

Ops is the new code: Operate First brings open source to operations

Gordon Haff

Operations are attracting increased attention in the open source community, and the open source ethos is evolving to embrace it. The focus of open source was initially on the code. Over time, however, the health of communities creating that code and associated artifacts such as documentation has also become an open source issue. The approach […]

Feature

Guardrailing large language models with TrustyAI Guardrails Orchestrator

Dr Mac Misiura

Christina Xu

Dr. Dominik Dahlem

As organizations push LLMs into more consequential domains, trust becomes the foundation for scale. Engineers in the open source TrustyAI project developed a guardrailing solution to ensure open source LLMs are both capable and safe for high-stakes deployments. Building trustworthy and controllable enterprise-grade LLM systems is challenging. These systems are increasingly being adopted across domains, […]

Feature

Protecting data privacy: a look in our current toolkit

Gordon Haff

The research uses for data could be endless, but without meeting stringent privacy requirements, some of the most promising analyses may never begin. “Data is the new oil” is a shorthand generally credited to UK mathematician Clive Humby. The saying got considerable play when “Big Data” was the latest catchphrase around a decade ago. As […]

Feature

The need for constant-time cryptography

Ján Jančár

Timing attacks have been used successfully against a variety of popular encryption techniques, but they can be prevented with consistent use of constant-time code practice. Cryptography provides privacy for millions of people, whether by ensuring end-to-end encrypted messaging, securing more than ninety percent of the web behind HTTPS, or establishing trust behind the digital signatures […]

Feature

Translation layers for the cloud: speeding storage performance

Peter Desnoyers

A guide to understanding the hidden algorithms that manage the data in our everyday world, from smartphones to cloud apps. We look at which ones perform faster—and why.