Student research spotlight: Dominik Tuchyňa
Many developers rely on LLMs to generate unit tests. Dominik Tuchyňa, a recent Master’s student at the Faculty of Informatics at Masaryk University (MUNI), focused his thesis research on developing a tool that could help benchmark these models and extend the scope of explainability into the world of machine-generated code. Developing tools that enhance benchmark test quality can improve internal LLM models in AI-assisted development tools to support objectives for product excellence, developer experience, and long-term scalability.
Dominik’s Red Hat thesis supervisor was Marek Grác, an engineer with the Red Hat Research team who specializes in AI fields including machine learning and natural language processing. Before starting his Master’s degree, Dominik was an intern in AI Data Engineering in the Red Hat Office of the CTO, and he is currently an Assistant Vice President Python Developer at Barclays in Prague.
Why did you choose to focus on LLM-generated unit tests as your thesis topic?
When first thinking about a thesis topic, I was interested in source-code-based text prediction. These were pre-Covid times, and tools such as Copilot did not exist yet, commercially or even academically. My first idea was to try code embeddings and tune it up to code change embeddings based on drilling git commits from open source repositories. Unfortunately, these concepts were too computation exhaustive and the topic itself too complicated for me to grasp at the time, with no clear end goal defined. Since then my thesis topic changed quite a few times, and I paused my studies. When I got back to school after two years of professional experience, I was able to see the scope of the thesis pragmatically. I made a deal with my supervisor not to do anything complicated but still keep it in the realm of AI source-code operations.
Tell us about your thesis and the tools you developed.
My thesis is titled “Analysis and benchmarking of LLM generated unit tests.” It was an extension of a previous thesis done by Alexandra Skysľaková, which was benchmarking LLM models code testing capabilities based on classic test adequacy metrics such as line coverage or assertion density. These metrics on their own are not sufficient. My thesis examines the possibility of using mutation analysis to its own advantage with respect to machine-generated code with a novel approach of using a test differentiator to include test failures with unknown oracles. In addition to evaluating the outputs and performance of LLMs using mutation metrics, the work introduces an extended version of the MutPy library featuring a test differentiator based on cosine similarity, as well as two new metrics for assessing fault-detection effectiveness: a granular mutation score at the level of individual tests and the RAPFD (Relative Average Percent of Faults) metric for evaluating test prioritization.
To create a limitation of the overall scope, I used the same three tested LLMs with the same generated data for mutation-testing-based benchmarking. It would be great to test newer versions of these models along with other open source available alternatives, but that would increase the thesis scope. You can always try going broader, but if you do not take care and set limitations, you can fail in your overall goal due to time constraints and mental exhaustion. That’s also the reason why I selected the Python language as the only tested subset of generated data. Among the other previously benchmarked generated code (Java, Go, or Kotlin), Python was probably the easiest for me to work with, as I am a Python developer. I would also argue that the simplicity of developing in Python makes it the most foolproof. That’s not to say I did not have problems during the development and testing phase—I definitely did.
What were the key criteria you used to evaluate the quality of the generated test cases?
There are implemented code tasks taken from Rosetta code dataset, and for each implemented task (implementation file), the LLM generates a file containing a test suite of several unit test cases. The way mutation analysis works is that you take a program, modify it with some small deviation (i.e., mutate it), and then run the benchmarked test against it. If the test captures this artificial bug, it is generally seen as a good sign.
First, I took into account how many tests were suitable for mutation analysis per benchmarked model. If generated code cannot be run in the benchmarking pipeline, the model should be penalized. Then there are mutation testing metrics (actual results of the mutation analysis per each test suite), such as mutation score or implemented RAPFD, a metric used to evaluate the prioritization of unit test cases inside a test suite.
Did your project achieve its goals?
It did, and I extended it beyond my original goals by implementing additional metrics (such as RAPFD) and theoretical concepts (test differentiators using cosine similarity), which was definitely not intended at the beginning of thesis development.
What was the biggest challenge while writing your thesis?
Definitely working with external libraries. Although I love open source and the community surrounding it, this expression still holds: with great freedom comes great responsibility. By the way, this was a motto for me in my Red Hat intern days—I even had stickers of it all over my laptop.
Another challenge was having a correct implementation with expected outcomes. It is important for a student to write their own unit test cases based on implemented research concepts, so that they are trustworthy not only for the reader but also for the author.
Red Hatter Marek Grác was your Master’s thesis supervisor. How has he helped you?
I like Marek’s honest attitude and practical approach to solving problems. It’s quite different from some of the academic nonsense a student can sometimes experience with pseudo-intellectuals striving to downplay student abilities for the sake of their own ego. The most important thing was to recognize my previous failed attempts and define a clear, practical goal.
You mentioned that you were also an intern at Red Hat in AI Data Engineering in the Office of CTO. Could you tell us a little bit about this experience?
I was interested in working with Red Hat because it is a name recognized globally in the software development domain, with products recognized worldwide. Plus, at the time I was living a few tram stops away! I worked on Thoth-station software stack resolution and source-operation metrics for AIOps-related open source work.
My experience at Red Hat was absolutely invaluable. (Note to non-English speakers: that means it was so great that it cannot be valued!) I was coming into contact with the best software development practices with really chill people —so much so that it taught me how to work in my future endeavors as well (shout out to my ex-manager Christoph Görn). And since it was the Office of the CTO, I got to use the newest, cutting-edge technology software stack.
What advice would you give to a new Red Hat intern?
Set a clear goal, be honest about your capabilities, and if there are blockers, talk about it openly. With respect to working on a software project, always bear in mind my favorite Unix philosophy: Clarity is better than cleverness. Also, do not take it too seriously, and remember to choose life, rather than work for life. Go for a walk, a concert, write poetry, watch a movie, read an interview with that cool underground hardcore band whose lyrics you have been trying to decipher. Don’t be soulless—ignite a spark, so your work can be on fire!