Student research yields a new tool for benchmarking LLM-generated unit tests

Sep 3, 2025 | Blog

Student research spotlight: Dominik Tuchyňa

Many developers rely on LLMs to generate unit tests. Dominik Tuchyňa, a recent Master’s student at the Faculty of Informatics at Masaryk University (MUNI), focused his thesis research on developing a tool that could help benchmark these models and extend the scope of explainability into the world of machine-generated code. Developing tools that enhance benchmark test quality can improve internal LLM models in AI-assisted development tools to support objectives for product excellence, developer experience, and long-term scalability.

Dominik’s Red Hat thesis supervisor was Marek Grác, an engineer with the Red Hat Research team who specializes in AI fields including machine learning and natural language processing. Before starting his Master’s degree, Dominik was an intern in AI Data Engineering in the Red Hat Office of the CTO, and he is currently an Assistant Vice President Python Developer at Barclays in Prague.

Why did you choose to focus on LLM-generated unit tests as your thesis topic?

When first thinking about a thesis topic, I was interested in source-code-based text prediction. These were pre-Covid times, and tools such as Copilot did not exist yet, commercially or even academically. My first idea was to try code embeddings and tune it up to code change embeddings based on drilling git commits from open source repositories. Unfortunately, these concepts were too computation exhaustive and the topic itself too complicated for me to grasp at the time, with no clear end goal defined. Since then my thesis topic changed quite a few times, and I paused my studies. When I got back to school after two years of professional experience, I was able to see the scope of the thesis pragmatically. I made a deal with my supervisor not to do anything complicated but still keep it in the realm of AI source-code operations.

Tell us about your thesis and the tools you developed.

My thesis is titled “Analysis and benchmarking of LLM generated unit tests.” It was an extension of a previous thesis done by Alexandra Skysľaková, which was benchmarking LLM models code testing capabilities based on classic test adequacy metrics such as line coverage or assertion density. These metrics on their own are not sufficient. My thesis examines the possibility of using mutation analysis to its own advantage with respect to machine-generated code with a novel approach of using a test differentiator to include test failures with unknown oracles. In addition to evaluating the outputs and performance of LLMs using mutation metrics, the work introduces an extended version of the MutPy library featuring a test differentiator based on cosine similarity, as well as two new metrics for assessing fault-detection effectiveness: a granular mutation score at the level of individual tests and the RAPFD (Relative Average Percent of Faults) metric for evaluating test prioritization.

To create a limitation of the overall scope, I used the same three tested LLMs with the same generated data for mutation-testing-based benchmarking. It would be great to test newer versions of these models along with other open source available alternatives, but that would increase the thesis scope. You can always try going broader, but if you do not take care and set limitations, you can fail in your overall goal due to time constraints and mental exhaustion. That’s also the reason why I selected the Python language as the only tested subset of generated data. Among the other previously benchmarked generated code (Java, Go, or Kotlin), Python was probably the easiest for me to work with, as I am a Python developer. I would also argue that the simplicity of developing in Python makes it the most foolproof. That’s not to say I did not have problems during the development and testing phase—I definitely did.

What were the key criteria you used to evaluate the quality of the generated test cases?

There are implemented code tasks taken from Rosetta code dataset, and for each implemented task (implementation file), the LLM generates a file containing a test suite of several unit test cases. The way mutation analysis works is that you take a program, modify it with some small deviation (i.e., mutate it), and then run the benchmarked test against it. If the test captures this artificial bug, it is generally seen as a good sign.

First, I took into account how many tests were suitable for mutation analysis per benchmarked model. If generated code cannot be run in the benchmarking pipeline, the model should be penalized. Then there are mutation testing metrics (actual results of the mutation analysis per each test suite), such as mutation score or implemented RAPFD, a metric used to evaluate the prioritization of unit test cases inside a test suite.

Did your project achieve its goals?

It did, and I extended it beyond my original goals by implementing additional metrics (such as RAPFD) and theoretical concepts (test differentiators using cosine similarity), which was definitely not intended at the beginning of thesis development.

What was the biggest challenge while writing your thesis?

Definitely working with external libraries. Although I love open source and the community surrounding it, this expression still holds: with great freedom comes great responsibility. By the way, this was a motto for me in my Red Hat intern days—I even had stickers of it all over my laptop.

Another challenge was having a correct implementation with expected outcomes. It is important for a student to write their own unit test cases based on implemented research concepts, so that they are trustworthy not only for the reader but also for the author.

Red Hatter Marek Grác was your Master’s thesis supervisor. How has he helped you?

I like Marek’s honest attitude and practical approach to solving problems. It’s quite different from some of the academic nonsense a student can sometimes experience with pseudo-intellectuals striving to downplay student abilities for the sake of their own ego. The most important thing was to recognize my previous failed attempts and define a clear, practical goal.

You mentioned that you were also an intern at Red Hat in AI Data Engineering in the Office of CTO. Could you tell us a little bit about this experience?

I was interested in working with Red Hat because it is a name recognized globally in the software development domain, with products recognized worldwide. Plus, at the time I was living a few tram stops away! I worked on Thoth-station software stack resolution and source-operation metrics for AIOps-related open source work.

My experience at Red Hat was absolutely invaluable. (Note to non-English speakers: that means it was so great that it cannot be valued!) I was coming into contact with the best software development practices with really chill people —so much so that it taught me how to work in my future endeavors as well (shout out to my ex-manager Christoph Görn). And since it was the Office of the CTO, I got to use the newest, cutting-edge technology software stack.

What advice would you give to a new Red Hat intern?

Set a clear goal, be honest about your capabilities, and if there are blockers, talk about it openly. With respect to working on a software project, always bear in mind my favorite Unix philosophy: Clarity is better than cleverness. Also, do not take it too seriously, and remember to choose life, rather than work for life. Go for a walk, a concert, write poetry, watch a movie, read an interview with that cool underground hardcore band whose lyrics you have been trying to decipher. Don’t be soulless—ignite a spark, so your work can be on fire!

blog

CODECO: a deeper dive into the novel edge-cloud framework

CODECO stands for Cognitive Decentralized Edge to Cloud Orchestration. The open source software framework, pluggable to Kubernetes, aims to improve the energy efficiency and robustness of edge-cloud infrastructure by improving application deployment and runtime. by...

QUBIP for post-quantum cryptography demos pilots for IoT, telco

By Dmitry Belyavskiy, Red Hat Principal Software Engineer The transition to post-quantum cryptography (PQC) has been one of the hottest security topics of the last several years, as expected advancements in quantum computing continue to increase the risk of quantum...

Choosing LLMs to generate high-quality unit tests for code

Student research spotlight—Alexandra Skysľaková Not all large language models (LLMs) are equally good at generating tests for all programming languages. Alexandra Skysľaková, a recent graduate from the Faculty of Informatics at Masaryk University (MUNI), focused her...

What the Massachusetts AI Hub could mean for AI innovation

High-impact AI solutions to global challenges are within reach. Here’s how Massachusetts’ big bet on equity and collaboration helps. By Orran Krieger Opportunities for AI development in open source got a big boost in December when Massachusetts Governor Maura Healy...

New solutions for drug discovery: harnessing the power of open cloud and open source AI

By Gagan Kumar The convergence of open source technology and artificial intelligence is transforming drug discovery, introducing new standards of transparency, collaboration, and innovation. On October 30th, leaders from research, industry, and academia gathered at...

Intern spotlight: Arlo Albelli, bird nerd and builder of architecture-agnostic optimizations

PhD interns at Red Hat Research’s partner universities play a pivotal role in bringing together the cutting-edge thinking of research institutions with the real-world expertise of industry. The PhD program enables long-term research partnerships that provide greater...

Fedora Linux transition for quantum resistant cryptography

By Dmitry Belyavskiy While numerous robust post-quantum (PQ) standards exist, along with various projects implementing them, widespread adoption for communication and data protection hinges on their integration into mainstream OS distributions. By incorporating these...

Student research spotlight: Jakub Suchánek studies authentication in public open source repositories

Understanding user perception and behavior is often neglected in open source software (OSS) security. Jakub Suchánek, a student of the Faculty of Informatics at Masaryk University, collaborated with Red Hat Research on a project investigating authentication in public...

Intern spotlight: Eric Munson builds guitars and Unikernel Linux

Correctness in distributed systems: the case of jgroups-raft

By José Bolina Building distributed systems is complex work, but strong primitives with well-defined guarantees and an expected behavior can make it easier. With stronger guarantees in primitives come strong safety and correctness verification requirements. In some...