Student research spotlight—Alexandra Skysľaková
Not all large language models (LLMs) are equally good at generating tests for all programming languages. Alexandra Skysľaková, a recent graduate from the Faculty of Informatics at Masaryk University (MUNI), focused her Master’s thesis on comparing popular LLMs and their effectiveness with several common languages. Alexandra received the Dean’s Award for Outstanding Final Thesis for her work, which provides valuable insights into using LLMs for better software development.
Alexandra’s Red Hat thesis supervisor was Marek Grác, a research fellow with the Red Hat Research team who specializes in AI fields including machine learning and natural language processing.
Why did you choose to focus on LLMs for test generation as your thesis topic?
LLMs are powerful tools with the ability to significantly change the IT industry. They can dramatically simplify some areas of development that are often repetitive and take up too much time, like testing. Unit testing is important for maintaining high-quality code, but creating comprehensive tests can be tedious and time-consuming. LLMs provide a seemingly perfect tool that could potentially help increase developer productivity.
Tell us about your thesis and its findings.
My thesis explores how LLMs can generate high-quality unit tests for software code. It compares three large language models—OpenAI’s GPT-4o, Google’s Gemini-1.5-pro-002 and DeepSeek-Coder-V2.5—assessing their strengths and weaknesses in producing tests for Python, Java, Kotlin, and Go. Carefully selected metrics, such as syntactic correctness, executability, code coverage, or test quality, are used to determine which models excel under specific conditions.
The findings reveal that all models perform best in Python, but their effectiveness varies widely across other languages. For instance, GPT excelled in Kotlin, Gemini stood out in Java, and DeepSeek provided surprisingly good results in Go. Despite these differences, the study found no clear relationship between test quality and code complexity or length, suggesting that the models’ reasoning processes may not depend on these factors.
This thesis helps practitioners make informed decisions about choosing and applying AI-driven tools for automated test generation by pinpointing where each model thrives. It also highlights areas for future research, such as refining model behaviour across different programming languages and improving test quality standards. In doing so, this work offers valuable insights into how next-generation AI models might further enhance and streamline automated software testing, ultimately making software more reliable. (Learn more and see the full text of Alexandra’s thesis on the MUNI website.)
Why did you decide to test these specific LLM models?
GPT is one of the most popular LLMs nowadays, but it is also pricey. I definitely wanted to include it in this study. Then, I selected Gemini’s newest Pro model, as it is considered one of the main competitors to GPT, but is a little more affordable. As a third model, I decided to use DeepSeek, because it is open source. It produces results of comparable quality with top-notch models and is available for just a small fraction of the price of the other models.
Why did you decide to focus on these programming languages?
Each language poses a unique challenge for LLM-based test generation. Python’s simplicity and dynamic typing test a model’s ability to handle ambiguities, while Java’s verbose syntax and object-oriented style require attention to well-established testing patterns. Kotlin adds modern features like null safety and Java interoperability, while Go focuses on concurrency and explicit error handling and it may not be as well represented in the training datasets, since it is less common than Python or Java. Evaluating these different languages shows how effectively LLMs can generate tests across a range of coding styles.
What were the key criteria you used to evaluate the quality of the generated test cases?
The thesis evaluates test quality using several key criteria combined into one final score. It first checks for syntax errors and runtime failures, ensuring the generated tests compile and run correctly. Then test coverage is measured to determine how much of the code under test is exercised. Execution time is also tracked, as shorter tests can be more efficient in continuous integration pipelines. After that, common code smells are detected by using a tooling typical for a given language in order to indicate structural or maintainability issues in the test code. Finally, two assertion-based metrics are used: the Assertions-McCabe ratio (which compares the number of assertions to the complexity of the tested code) and assertion density (the number of assertions per line of test code). Together, these measures provide a comprehensive view of test correctness, efficiency, and maintainability.
What impact can the results of your findings have on software development?
The findings of my thesis show some weaknesses in each of the models, which can be taken into account when selecting the best model for generating tests for a specific use case. For example, when testing Java code, Gemini models might be a better choice than GPT. The LLMs usually advertise mainly their strongest sides, and while the results may be great in Python, the generated code in less benchmarked languages may be significantly worse than what users would expect. The results can also be used as a base for further research of the topic. Having a clear idea of what goes well and where weaknesses appear when generating tests can help not only the end users, but also the companies developing LLMs, who can gain more insight into what can be improved in their models.
What was the biggest challenge when working on your thesis?
I would say the biggest challenge was selecting criteria for evaluating the quality of the generated tests. It is very hard to tell what a good test is and even harder to make it automatic and deterministic, so that it can be evaluated through code. The thesis evaluates a small set of aspects, and there are a lot of other ways to evaluate the tests. So, the next challenge was to recognize the scope of the thesis was already big and complex enough and not to include more. I am a perfectionist, so it was hard for me to tell myself it is enough. Luckily, there already is a student who will follow up on this topic.
Your thesis was awarded for its contributions—what do you believe made it stand out?

I think it stood out because of its complexity and the work required to finish it. There was no ideal dataset to start with, so I had to tweak an existing one to fit my needs. I built a meaningful evaluation pipeline and evaluated the code in multiple languages with multiple models. When writing the thesis, there was little existing research or papers on the topic, which made my work harder. It resulted in a study that had no direct precedent and could be considered a base for further research.
How did writing this thesis and working with your supervisor help you develop your skills and knowledge?
My supervisor, Marek Grác, significantly helped me in the beginning, discussing possible ways to understand this topic and what to focus on. I had free rein in choosing the tools and languages, as well as in determining how the thesis would be conducted and implemented. Marek was always ready to answer my questions and guide me whenever I needed support.
From working on my thesis, I gained better insight into how different LLMs work and their strengths and weaknesses, but I also gained more knowledge about tests and how they can be evaluated. Diving into research is not something I do on a daily basis, so it was an interesting change that resulted in a lot of ideas how LLMs could be applied to the industry.