Red Hat Research Quarterly

Generative AI and large language models: how did we get here, where are we going, and what does it mean for open source?

Sanjay Arora

Sanjay Arora Sanjay Arora leads the AI agenda for Red Hat Research and is mainly interested in the application of machine learning to low-level systems.

about the author

Richard Fontana

Richard Fontana is Senior Commercial Counsel at Red Hat and founder of Red Hat’s Technology and Open Source legal team. Before coming to Red Hat in 2008, Richard was counsel at the Software Freedom Law Center (SFLC) and served as one of the three principal authors of GPLv3. For several years, he was a board director for the Open Source Initiative and chaired its license review committee.

Article featured in

Red Hat Research Quarterly

August 2023

Download PDF

Subscribe now

In this issue

From the Director

The uncertainty principle

Hugh Brock

News

Red Hat Collaboratory at Boston University seeks proposals for 2024

News

Hybrid cloud, edge, and security research featured at DevConf.CZ

News

Publication highlights—August 2023

Interview

“Research is an adventure”: Putting theory to the test at the university and in the field

Martin Ukrop

Feature

Unikernel Linux (UKL) moves forward

Richard Jones

Feature

“Open source opens doors”: mentoring students for success

Heidi Dempsey

Project Updates

Research project updates—August 2023

Feature

Generative AI and large language models: how did we get here, where are we going, and what does it mean for open source?

Sanjay Arora

Richard Fontana

We may not have all the answers, but we’re homing in on the essential questions about the future of AI and machine learning.

If you’ve somehow managed to escape the last nine months of breathless headlines and wild speculation about ChatGPT and what it means for humanity, you are lucky indeed. It’s not as though machine learning, large language models (LLMs), and image generation are particularly new ideas, or even particularly revolutionary. However, the sudden availability of the ChatGPT “oracle” and the services competing with it have captured popular imagination on the scale of the moon landings, and not surprisingly. Even though an LLM is not much more than a linear regression over a big pile of data, it seems like a real intelligence. That makes it both interesting and scary for all kinds of reasons. When my retiree neighbors start asking me questions about AI, I think it indicates that a fundamental shift has happened.

There is no question that Generative AI—in short, a system that can generate text, images, or other media in response to prompts—is going to become both ubiquitous and required in academic research, in industry, and in teaching and learning. But its growing popularity raises important questions for software engineers and researchers, particularly those of us concerned with open source. How do large models come into being? Why now? Where are they likely to go next? Can a model be open source, freely modifiable, and redistributable by others? Does modifying an open source model, if such a thing exists, require the original data? All of it, or just some? If I use the output of a model in my code or my writing, is it still mine? If not, whose is it?

To answer these questions, we asked Red Hat Research’s AI leader Sanjay Arora to help us understand how we got here and where, exactly, “here” is. We then asked Red Hat Legal’s leading thinker on open source and licensing, Richard Fontana, to help us understand the relationship between the models behind Generative AI, open source licensing, and software development. Although they leave us with many unsettled questions, I believe they impart a solid understanding of what Generative AI really means for open source and IT, and what to look for in the future.

—Hugh Brock, Director, Red Hat Research

How we got here

There has been impressive recent progress in machine learning models that generate realistic, high-quality text and images, including models like the GPT family for text generation and stable diffusion for image generation. While the public implementations of models like GPT have attracted plenty of hype and attention, tracing the historical development of ideas that led to them is instructive for understanding both their shortcomings and possible future developments.

Neural networks rose in prominence in 2012 after the image classification model AlexNet, a convolutional network, beat state-of-the-art benchmarks by a large margin on a standardized dataset called ImageNet.^{1, 2} This model was trained using a combination of backpropagation and gradient descent,³ both very old ideas. AlexNet was also one of the first implementations using a graphics processing unit (GPU) for speeding up forward and backward propagation. The main surprise here was that a multi-layered neural network could be trained by a first-order derivative-based local optimization technique like gradient descent, and that this could be done in a reasonable amount of time using GPUs.

The main surprise here was that a multi-layered neural network could be trained by a first-order derivative-based local optimization technique like gradient descent, and that this could be done in a reasonable amount of time using GPUs.

This event led to an explosion of activity in multi-layered neural networks. Significant advances made training more stable—new activation functions, variants of gradient descent, various normalization and regularization strategies, architectural choices like skip connections, and so on—and enabled scaling to larger datasets using GPUs. The focus of deep learning practitioners switched from feature selection to devising better neural network architectures that would allow efficient end-to-end learning. The central idea was that given a labeled dataset of (input, output) pairs, one had to devise an architecture that could (a) operate on the inputs, (b) produce outputs of the right format (image, text, discrete labels, real-valued vectors), and (c) have enough weights or capacity to learn the mapping efficiently. This model could then be trained to learn the task of mapping the input to the output if such a mapping existed—in other words, if the outputs could be predicted from the inputs.

Self-supervised learning

The need for high-quality labeled data became a major bottleneck. Labeling data is generally a slow, laborious process that needs domain experts. Labeling tumors on radiological scans, which is very time-consuming, is a representative example. An elegant and very effective idea was resurrected to sidestep this requirement: use one part of the input data to predict another part of the input data that was masked or omitted from the actual input to the model. This is called self-supervised learning.

Examples include:

Predicting which word occurs in a sentence based on neighboring words
Predicting the next token in a sentence
Splitting an image into a 3 x 3 grid and training a network to predict the relative ordering of any two patches from the image
Converting an image to grayscale and predicting the RGB image from the grayscale image

These so-called auxiliary tasks don’t need additional labeled data. Instead, the input data can be cleverly used to define a prediction task that a neural network can be trained to perform. The reason this is useful is particular to neural networks. Neural networks can be thought of as iterative maps of vectors to vectors. Here, a vector refers to a collection of numbers that can be added element-wise and multiplied by numbers (e.g., three-dimensional coordinate vectors in elementary geometry). This means that an input, such as an image or text, can be mapped to a vector that is an intermediate step for calculating the output.

These intermediate vectors, also called representations or embeddings, have a very interesting property: similar inputs get mapped to nearby vectors. Here, similar is a subjective notion. For example, we consider two images similar if they represent the same object, even if the angles, lighting, or background are different. Nearby, on the other hand, is a very concrete mathematical concept: two vectors are nearby if their difference vector has a short length. So we now have an explicit mathematical way of representing the subjective notion of similarity. To take advantage of this for self-supervised learning, you collect a large dataset, define a self-supervised task, and train a neural network on the task. You can then use this neural network to map each input object to its vector representation, which serves as a compressed numerical representation of the more complex input object.

Once a model has been trained on a self-supervised task, it has learned the underlying structure inherent in the inputs. For language, this includes the grammar and structure of sentences and paragraphs. For images, this structure is the joint probability distribution of pixels that describes realistic images. In other words, the representations learned by the model are now meaningful for other tasks. The most common way we harness these representations for specific tasks is through a process called fine-tuning. Fine-tuning involves taking the pretrained model (i.e., the model trained in a self-supervised way) and training it further on small supervised tasks. For example, a pretrained language model can be trained further on a small dataset of reviews and their sentiment—“positive” or “negative,” say—to create a sentiment analysis model. The powerful representations learned by the pretrained network make training on small supervised datasets very effective. We call this semi-supervised training.

Scaling up

As self-supervised training followed by fine-tuning demonstrated its effectiveness, another significant development occurred: the invention of a new architecture called the transformer. The overwhelming architectural choice for natural language (or sequential data, in general) used to be recurrent neural networks in their various avatars, like long short-term memory networks (LSTMs) or gated recurrent units (GRUs). They operated on the input tokens in a loop (i.e., sequentially), which is hard to parallelize. Transformers operate on all input tokens at once. To account for long-range dependencies and relationships between tokens, they use a mechanism called self-attention, which is a concrete way of scoring the relationship between any two tokens. Transformers impose fewer biases on the input data but can be scaled to much larger datasets given a fixed time budget. In recent years, transformers have been extended to other data modalities, especially computer vision (i.e., to images).

Combining transformers with semi-supervised training made it possible to train models on vast amounts of data.

Combining transformers with semi-supervised training made it possible to train models on vast amounts of data. Based on past models, some scaling hypotheses were posited that estimated the amount of data, amount of compute, and model sizes required to achieve a certain model quality, where “quality” means test loss, or how accurately a model predicts on a hold-out dataset not used during training.⁴ These efforts led to training large language models like the GPT series, LLaMa, and PaLM.

Efficient fine-tuning

A few other innovations have been instrumental. The first is a technique called LoRA (low-rank adaptation),⁵ which addresses the challenge of fine-tuning large pretrained models. LoRA dramatically reduces the number of effective parameters that must be changed during fine-tuning. This makes it possible to fine-tune massive LLMs on small datasets and limited hardware in a reasonable amount of time.

Another major idea is instruction tuning. While one could take a pretrained network and fine-tune it on various tasks like question-answering or text summarization, this process would result in one fine-tuned network per task. Ideally, one would need just one network to perform all the tasks, with the task being passed as an additional input. For example, one could pass the task and the input together as “Summarize the following text: [text]” or “Answer the question: [question].” The downside of this approach is that the network can only perform tasks seen during training and is very sensitive to how the query is structured. Instruction tuning allows one to train a single network to perform multiple tasks by feeding the task description as text to the model.⁶ As models got larger and had the capacity to perform multiple tasks, instruction tuning led not only to a model performing tasks seen during training but also to one that could generalize to new tasks outside the training set.

Yet another development is reinforcement learning from human feedback (RLHF). All language models are trained to predict probabilistic outputs; in other words, the output is a probability distribution over all possible tokens. The probabilities are then sampled to generate concrete tokens. This means that while the probabilities are fixed for a given input, the sampling process will generate a different output each time. Some of these outputs are qualitatively better than others. For a given query or input, a network’s output is sampled several times to produce different outputs, which human testers then rank. Ideally, the network would produce outputs that are highly ranked. The sampled outputs and their human ranks are fed back to the network to encourage the network to produce highly ranked answers. This is done using the framework of reinforcement learning (on-policy, model-free methods like proximal policy optimization, for the experts), where the ranks act as rewards.

Implications for industry

As generative models permeate industry and are adopted by professionals, their computational requirements and footprint will only rise. Even if training from scratch is limited to a few large companies, every institution using these models will have to invest in infrastructure for fine-tuning and inference. This infrastructure might be a third-party cloud service or on-premises hardware, but either will require a significant investment of time and resources.

The question is not if LLMs will give some software companies a competitive advantage in the marketplace, but when and in what areas.

There’s interest within most if not all large companies in using these models, especially LLMs. While a lot can be done through simple fine-tuning, robust application of LLMs in an industrial setting requires evaluating and implementing a lot of tools to ensure correctness and verify produced results. General education and guidelines for the appropriate use of various techniques, while a significant investment, will also ensure that we use these tools effectively.

For the software industry specifically, LLMs provide ways to significantly improve software production, application performance, and customer service— in some cases, radically so. The question is not if LLMs will give some software companies a competitive advantage in the marketplace, but when and in what areas.

Questions remain

Even with all these exciting developments, language models have several problems. Chief among them is that they often hallucinate outputs, meaning that the outputs can state “facts” that cannot be inferred from the training data. This has massive implications for using these models in any realistic applications. While one school of thought maintains that the only solution is using other methods (e.g., knowledge graphs, symbolic methods) in conjunction with language models, another school of thought believes these problems can be solved within the deep learning paradigm. Only time will be the judge of who is right. Another major problem with these models is their training cost, especially in terms of energy consumption. This problem can be approached at various levels, from more efficient hardware to better, more sample-efficient algorithms. There has also been promising progress in using more carefully curated data⁷ to train much smaller models with performance similar to larger ones.

Aside from these practical questions around accuracy and cost, the very effectiveness of these models—the likelihood that people will actually use them en masse for things—means that we have to begin thinking about the legal and ethical issues relating to their use, particularly looking through an open source lens. Is it legal to use publicly available data for training? When, and to what extent? Should attribution or even compensation be required? Can models be considered copyrightable subject matter? If so, how should they be licensed? How should derivative, fine-tuned models be licensed? We examine some of these questions below.

Does copyright apply?

From a legal and ethical perspective, generative AI, including LLMs, has inspired many debates about copyright, licensing, and even the principles of open source. Right now, we are a long way from clear answers, and it’s worth keeping in mind that a growing number of entities could lobby to influence the state of copyright law, and courts may hand down decisions that change the law in unexpected ways. While we wait for greater clarity, these are some critical questions to keep in mind. Note: most of what follows applies specifically to US law. The complexity of considering the treatment of this topic under a multiplicity of legal jurisdictions is generally beyond the scope of this article.

How will we determine limits on the use of training data?

Individual items of training data will in some cases be copyrighted, and those who assemble training data sets may in some cases have a copyright interest in the data set as a whole. Outside the US, some countries have additionally recognized sui generis database rights or other rights in non-creative data compilations. If training data is under copyright, the copyright owner can set limits or conditions on the freedom of others to make, distribute, or adapt copies. If you are training a model, you are necessarily making copies of the training data. Individual copyrightable data items (which could be, for example, some text, source code, or an image) might be covered by no license, or they may be covered by a license ranging from (a) one permitting essentially all uses with no conditions, to (b) relatively permissive licenses with limited conditions (open source, open content, and open data licenses are subsets of this category), to (c) relatively restrictive licenses (like proprietary software licenses).

This is not the end of the story, however. In the US, the fact-dependent doctrine of fair use allows copying of copyright-protected materials for certain purposes such as education, research, and journalism, while some other countries have begun to legislatively carve out exceptions to copyright law for activities like text and data mining and web scraping. These limits on copyright protection likely benefit activities involved in training models.

Can learning models themselves be copyrighted?

A deep learning model is specified by its architecture and parameters—its weights and biases. While courts have not yet addressed this issue, it seems unlikely that weights and biases can be subject to copyright protection, at least under current US law. In some circumstances, of course, a set of numbers may be copyrightable (for example, any digital encoding of some original, creative, and expressive content). But a model’s parameters are not such an encoding.

Copyright only covers original and creative expression. Ideas, for example, are not copyrightable. In its landmark decision in Feist Publications, Inc., v. Rural Telephone Service Co. (1991), the US Supreme Court held that information by itself—like a collection of phone numbers—is not protectable under copyright.

What claim do creators have to their labor?

The court has been clear that copyright exists to promote knowledge and creative expression, not to reward labor or restrict the sharing of facts.

The court in Feist concluded that the effort to compile mere information, no matter how laborious, had no impact on copyright protection, explicitly rejecting the earlier “sweat-of-the-brow” doctrine. (There may be other countries where doctrines like “sweat-of-the-brow” are viable.) The court has been clear that copyright exists to promote knowledge and creative expression, not to reward labor or restrict the sharing of facts. This point has some resonance in the 2021 case of Google LLC v Oracle America, Inc., in which the Supreme Court held that Google’s copying of the Java SE API, which included only those lines of code needed to allow programmers to create a new program, was a fair use of that material. In the trial phases of the case, Oracle placed some emphasis on the amount of effort and care that went into designing a complex API. An appeal to the value of labor may have emotional resonance for engineers who put in hours of work, but it should not properly have a bearing on the question of copyrightability. And, of course, if the weights and biases are not copyrightable, efforts to regulate use of the model parameters through purported copyright licensing should not be effective, though there may be alternative legal machinery for achieving such regulation.

How will open source licensing react to the rise of machine learning?

Open source licenses are primarily (though not entirely) low-friction forms of copyright licensing characterized by normative, customary limits on how restrictive the license conditions can be. One question that the open source and the machine learning practitioner communities are grappling with is whether existing open source licensing norms adequately address the issues introduced by generative AI. Open source licenses facilitate machine learning—above all because of the availability of powerful open source machine learning frameworks like PyTorch—but the open source licenses in use today were all developed before machine learning models became an issue of significant interest.

Several recent developments around AI have had an impact on broader ongoing debates over the proper meaning and scope of open source. Some organizations have been releasing machine learning artifacts, including model checkpoints, in public repositories on GitHub and HuggingFace, under restrictive licenses not compatible with the Open Source Definition (for example, licenses prohibiting commercial use or non-research use), yet describing such releases as “open source.” The Open Source Initiative, which maintains the Open Source Definition, has raised concerns about “open-washing” in the industry.

At the same time, some machine learning practitioners have been promoting so-called Responsible AI Licenses (RAIL), which feature use restrictions aimed at preventing the use of AI for purposes regarded as unethical or at odds with certain social policy goals. These restrictions are particularly centered around the use of model weights, despite their dubious protectability under copyright. The various restrictions in the RAIL licenses prevent them from satisfying the OSD, but some RAIL advocates no doubt believe that the definition of open source itself should be changed or relaxed to accommodate such new regulatory models as applied to community-released model artifacts.

Another development arising out of the tension between open source licensing and machine learning stems from the widely-publicized tendency of certain generative models to replicate potentially copyrightable portions of training data, an issue that underlies a number of recent lawsuits brought against companies developing and commercializing generative AI technology. The specific area of source code generative tools came to public attention with the launch of GitHub Copilot. Some open source developers, particularly those using copyleft licenses like the GPL, have not only argued that such replications typically will not comply with the requirements of open source licenses but have also raised broader concerns about the use of their code in training data—even though any open source license should permit such use. These developers may find it appealing to add license prohibitions against use in machine learning, even though, as with RAIL, this would represent a departure from open source licensing norms. Some developers of generative AI programming assistant tools have responded to these concerns in various constructive ways, such as enabling authors to opt out of having their code used in training data and attempting to document the provenance and licensing of generated output.

We should be skeptical about making copyright licensing do more than it should.

Amid all this ambiguity, one thing that’s clear is that we should be skeptical about making copyright licensing do more than it should. That’s a point Luis Villa made in the podcast series “Was open source inevitable”: “For years, we said the licenses were the only acceptable way to legislate behavior. . . . Maybe we wouldn’t be having so many of these discussions today if we’d said that codes of conduct are also important and how we behave with each other as peers and friends and human beings.” As the use of machine learning and generative AI expands, there’s a risk that people will make assumptions about what is actually licensable and enter into agreements that are not enforceable. There may also be some activities—disclosing your training data, or at least information about your training data, for example—that become social expectations even if they are not legal requirements mandated by a license. The practice of publishing “model cards” and similar information seems to point in this direction.

Realizing the potential of Generative AI and LLMs described in the first half of this article will depend on open source communities, industry, and AI/ML researchers working together in the open. The more roadblocks we set up, the slower the progress.

Footnotes

1. AlexNet. (2023, July 21). In Wikipedia. https://en.wikipedia.org/wiki/AlexNet

2. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2017). ImageNet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84-90. https://doi.org/10.1145/3065386

3. What is gradient descent? IBM.com. https://www.ibm.com/topics/gradient-descent

4. Kaplan, J., McCandlish, S., Henighan, T., et al. (2020). Scaling laws for neural language models. ArXiv. https://doi.org/10.48550/arXiv.2001.08361

5. Hu, E., Wallis, P., Allen-Zhu, Z., et al. (2022). LoRA: Low-rank adaptation of large language models. Proceedings of the International Conference on Learning Representations. https://doi.org/10.48550/arXiv.2106.096857. Gunasekar, S., Zhang, Y., Aneja, J., & Mendes, C. (2023).

6. Bosma, M., & Wei, J. (2021, June 10). Introducing FLAN: More generalizable language models with instruction fine-tuning. Google Research. https://ai.googleblog.com/2021/10/introducing-flan-more-generalizable.html

7. Gunasekar, S., Zhang, Y., Aneja, J., & Mendes, C. (2023). Textbooks are all you need. ArXiv. https://doi.org/10.48550/arXiv.2306.11644

SHARE THIS ARTICLE

BigDataStack delivers with contributions from industry and university partners

Yosef Moatti

Oshrit Feder

Guy Khazma

Gal Lushi

Paula Ta-Shma

Luis Tomás Bolívar

Miki Kenneth

Josh Salomon

Data skipping and network performance improvement technologies prove their value in data-intensive applications.

Feature

Isn’t multi-tenancy Ironic?

Tzu-Mainn Chen

Lars Kellogg-Stedman

Virtualization is an amazing technology that has become a popular solution for sharing resources among members of an organization. However, some organizations need to harness the capabilities of an entire machine, without a layer of virtualization between the code and the hardware. Is it possible to share hardware between projects with the same ease as sharing virtual resources?

Feature

RISC-V extensions: what’s available and how to find them

Richard Jones

Extensions available in RISC-V enable the customizations that make it ideal as a basis for open innovation. Here’s the extension situation as it stands today. RISC-V is a new Instruction Set Architecture (ISA) that, over the next decade, will compete with x86-64 and ARM in all areas, from the lowest-end IoT devices all the way […]

Feature

Where will we find the data scientists?

Jennifer Wood

Universities play a primary role in developing data skills, but traditional education alone can’t close the skills gap fast enough. The mismatch between the widespread need for strong data skills and the current workforce is an obstacle for nearly every sector of the economy, which means no single sector can solve it. Collaborative partnerships among […]

Feature

Creating a Linux-based unikernel

Gordon Haff

Is there a way to gain the performance benefits of a unikernel without severing it from an existing general-purpose code base? Boston University professors, BU PhD students, and Red Hat engineers at the Red Hat Collaboratory at Boston University are getting close to finding the answer. A unikernel is a single bootable image consisting of […]

Feature

Don’t blame the developers: making security usable for IT professionals

Martin Ukrop

Historically, usability studies have looked mostly at end users, doing focus groups or user testing with customers or the general public. This process often neglected developers, system administrators, and other IT professionals and the systems they use day to day.

Feature

Demystifying real-time Linux scheduling latency

Daniel Bristot de Oliveira

This is the third of a series of three articles about the formal analysis and verification of the real-time Linux® kernel. Read the first article in RHRQ 2:3 and the second article in RHRQ 2:4.

Feature

How to open source cloud operations

Marcel Hild

Open source has become a dominant paradigm for developing software. One major factor for its success is its transparency: if you have a problem with the software, you can peek into the details of the code, search the issue tracker, ask for help, and maybe even provide a fix. This means that even though most users don’t write code, the mere fact that everything is open will help the majority of users. Now it’s time to apply the open source model to the cloud.

Feature

Translation layers for the cloud: speeding storage performance

Peter Desnoyers

A guide to understanding the hidden algorithms that manage the data in our everyday world, from smartphones to cloud apps. We look at which ones perform faster—and why.