Choosing LLMs to generate high-quality unit tests for code

Apr 21, 2025 | Blog, Europe

Student research spotlight—Alexandra Skysľaková

Not all large language models (LLMs) are equally good at generating tests for all programming languages. Alexandra Skysľaková, a recent graduate from the Faculty of Informatics at Masaryk University (MUNI), focused her Master’s thesis on comparing popular LLMs and their effectiveness with several common languages. Alexandra received the Dean’s Award for Outstanding Final Thesis for her work, which provides valuable insights into using LLMs for better software development.

Alexandra’s Red Hat thesis supervisor was Marek Grác, a research fellow with the Red Hat Research team who specializes in AI fields including machine learning and natural language processing.

Why did you choose to focus on LLMs for test generation as your thesis topic?

LLMs are powerful tools with the ability to significantly change the IT industry. They can dramatically simplify some areas of development that are often repetitive and take up too much time, like testing. Unit testing is important for maintaining high-quality code, but creating comprehensive tests can be tedious and time-consuming. LLMs provide a seemingly perfect tool that could potentially help increase developer productivity.

Tell us about your thesis and its findings.

My thesis explores how LLMs can generate high-quality unit tests for software code. It compares three large language models—OpenAI’s GPT-4o, Google’s Gemini-1.5-pro-002 and DeepSeek-Coder-V2.5—assessing their strengths and weaknesses in producing tests for Python, Java, Kotlin, and Go. Carefully selected metrics, such as syntactic correctness, executability, code coverage, or test quality, are used to determine which models excel under specific conditions.

The findings reveal that all models perform best in Python, but their effectiveness varies widely across other languages. For instance, GPT excelled in Kotlin, Gemini stood out in Java, and DeepSeek provided surprisingly good results in Go. Despite these differences, the study found no clear relationship between test quality and code complexity or length, suggesting that the models’ reasoning processes may not depend on these factors.

This thesis helps practitioners make informed decisions about choosing and applying AI-driven tools for automated test generation by pinpointing where each model thrives. It also highlights areas for future research, such as refining model behaviour across different programming languages and improving test quality standards. In doing so, this work offers valuable insights into how next-generation AI models might further enhance and streamline automated software testing, ultimately making software more reliable. (Learn more and see the full text of Alexandra’s thesis on the MUNI website.)

Why did you decide to test these specific LLM models?

GPT is one of the most popular LLMs nowadays, but it is also pricey. I definitely wanted to include it in this study. Then, I selected Gemini’s newest Pro model, as it is considered one of the main competitors to GPT, but is a little more affordable. As a third model, I decided to use DeepSeek, because it is open source. It produces results of comparable quality with top-notch models and is available for just a small fraction of the price of the other models.

Why did you decide to focus on these programming languages?

Each language poses a unique challenge for LLM-based test generation. Python’s simplicity and dynamic typing test a model’s ability to handle ambiguities, while Java’s verbose syntax and object-oriented style require attention to well-established testing patterns. Kotlin adds modern features like null safety and Java interoperability, while Go focuses on concurrency and explicit error handling and it may not be as well represented in the training datasets, since it is less common than Python or Java. Evaluating these different languages shows how effectively LLMs can generate tests across a range of coding styles.

What were the key criteria you used to evaluate the quality of the generated test cases?

The thesis evaluates test quality using several key criteria combined into one final score. It first checks for syntax errors and runtime failures, ensuring the generated tests compile and run correctly. Then test coverage is measured to determine how much of the code under test is exercised. Execution time is also tracked, as shorter tests can be more efficient in continuous integration pipelines. After that, common code smells are detected by using a tooling typical for a given language in order to indicate structural or maintainability issues in the test code. Finally, two assertion-based metrics are used: the Assertions-McCabe ratio (which compares the number of assertions to the complexity of the tested code) and assertion density (the number of assertions per line of test code). Together, these measures provide a comprehensive view of test correctness, efficiency, and maintainability.

What impact can the results of your findings have on software development?

The findings of my thesis show some weaknesses in each of the models, which can be taken into account when selecting the best model for generating tests for a specific use case. For example, when testing Java code, Gemini models might be a better choice than GPT. The LLMs usually advertise mainly their strongest sides, and while the results may be great in Python, the generated code in less benchmarked languages may be significantly worse than what users would expect. The results can also be used as a base for further research of the topic. Having a clear idea of what goes well and where weaknesses appear when generating tests can help not only the end users, but also the companies developing LLMs, who can gain more insight into what can be improved in their models.

What was the biggest challenge when working on your thesis?

I would say the biggest challenge was selecting criteria for evaluating the quality of the generated tests. It is very hard to tell what a good test is and even harder to make it automatic and deterministic, so that it can be evaluated through code. The thesis evaluates a small set of aspects, and there are a lot of other ways to evaluate the tests. So, the next challenge was to recognize the scope of the thesis was already big and complex enough and not to include more. I am a perfectionist, so it was hard for me to tell myself it is enough. Luckily, there already is a student who will follow up on this topic.

Your thesis was awarded for its contributions—what do you believe made it stand out?

^{Marek Grác, research fellow with Red Hat Research and instructor at Masaryk University.}

I think it stood out because of its complexity and the work required to finish it. There was no ideal dataset to start with, so I had to tweak an existing one to fit my needs. I built a meaningful evaluation pipeline and evaluated the code in multiple languages with multiple models. When writing the thesis, there was little existing research or papers on the topic, which made my work harder. It resulted in a study that had no direct precedent and could be considered a base for further research.

How did writing this thesis and working with your supervisor help you develop your skills and knowledge?

My supervisor, Marek Grác, significantly helped me in the beginning, discussing possible ways to understand this topic and what to focus on. I had free rein in choosing the tools and languages, as well as in determining how the thesis would be conducted and implemented. Marek was always ready to answer my questions and guide me whenever I needed support.

From working on my thesis, I gained better insight into how different LLMs work and their strengths and weaknesses, but I also gained more knowledge about tests and how they can be evaluated. Diving into research is not something I do on a daily basis, so it was an interesting change that resulted in a lot of ideas how LLMs could be applied to the industry.

blog

CODECO: a deeper dive into the novel edge-cloud framework

CODECO stands for Cognitive Decentralized Edge to Cloud Orchestration. The open source software framework, pluggable to Kubernetes, aims to improve the energy efficiency and robustness of edge-cloud infrastructure by improving application deployment and runtime. by...

Student research yields a new tool for benchmarking LLM-generated unit tests

Student research spotlight: Dominik Tuchyňa Many developers rely on LLMs to generate unit tests. Dominik Tuchyňa, a recent Master’s student at the Faculty of Informatics at Masaryk University (MUNI), focused his thesis research on developing a tool that could help...

QUBIP for post-quantum cryptography demos pilots for IoT, telco

By Dmitry Belyavskiy, Red Hat Principal Software Engineer The transition to post-quantum cryptography (PQC) has been one of the hottest security topics of the last several years, as expected advancements in quantum computing continue to increase the risk of quantum...

What the Massachusetts AI Hub could mean for AI innovation

High-impact AI solutions to global challenges are within reach. Here’s how Massachusetts’ big bet on equity and collaboration helps. By Orran Krieger Opportunities for AI development in open source got a big boost in December when Massachusetts Governor Maura Healey...

New solutions for drug discovery: harnessing the power of open cloud and open source AI

By Gagan Kumar The convergence of open source technology and artificial intelligence is transforming drug discovery, introducing new standards of transparency, collaboration, and innovation. On October 30th, leaders from research, industry, and academia gathered at...

Intern spotlight: Arlo Albelli, bird nerd and builder of architecture-agnostic optimizations

PhD interns at Red Hat Research’s partner universities play a pivotal role in bringing together the cutting-edge thinking of research institutions with the real-world expertise of industry. The PhD program enables long-term research partnerships that provide greater...

Fedora Linux transition for quantum resistant cryptography

By Dmitry Belyavskiy While numerous robust post-quantum (PQ) standards exist, along with various projects implementing them, widespread adoption for communication and data protection hinges on their integration into mainstream OS distributions. By incorporating these...

Student research spotlight: Jakub Suchánek studies authentication in public open source repositories

Understanding user perception and behavior is often neglected in open source software (OSS) security. Jakub Suchánek, a student of the Faculty of Informatics at Masaryk University, collaborated with Red Hat Research on a project investigating authentication in public...

Intern spotlight: Eric Munson builds guitars and Unikernel Linux

Correctness in distributed systems: the case of jgroups-raft

By José Bolina Building distributed systems is complex work, but strong primitives with well-defined guarantees and an expected behavior can make it easier. With stronger guarantees in primitives come strong safety and correctness verification requirements. In some...