TAM-Eval: Evaluating LLMs for Automated Unit Test Maintenance

📅 2026-01-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited applicability of current large language models (LLMs) in unit testing, which typically focus on isolated test generation or assertion prediction and struggle to support holistic test suite maintenance. We propose TAM-Eval, a novel framework and benchmark that, for the first time, constructs a multilingual dataset of 1,539 real-world test maintenance scenarios at the test-file level across Python, Java, and Go, contextualized within actual software repositories. TAM-Eval introduces a reference-free evaluation protocol to uniformly assess both base models and agent-based workflows. Comprehensive evaluation using test pass rate, code coverage, and mutation score reveals that state-of-the-art models exhibit significant limitations in authentic tasks involving test creation, repair, and updating, thereby highlighting the substantial challenges and research opportunities in this domain.

Technology Category

Application Category

📝 Abstract
While Large Language Models (LLMs) have shown promise in software engineering, their application to unit testing remains largely confined to isolated test generation or oracle prediction, neglecting the broader challenge of test suite maintenance. We introduce TAM-Eval (Test Automated Maintenance Evaluation), a framework and benchmark designed to evaluate model performance across three core test maintenance scenarios: creation, repair, and updating of test suites. Unlike prior work limited to function-level tasks, TAM-Eval operates at the test file level, while maintaining access to full repository context during isolated evaluation, better reflecting real-world maintenance workflows. Our benchmark comprises 1,539 automatically extracted and validated scenarios from Python, Java, and Go projects. TAM-Eval supports system-agnostic evaluation of both raw LLMs and agentic workflows, using a reference-free protocol based on test suite pass rate, code coverage, and mutation testing. Empirical results indicate that state-of-the-art LLMs have limited capabilities in realistic test maintenance processes and yield only marginal improvements in test effectiveness. We release TAM-Eval as an open-source framework to support future research in automated software testing. Our data and code are publicly available at https://github.com/trndcenter/TAM-Eval.
Problem

Research questions and friction points this paper is trying to address.

unit test maintenance
Large Language Models
test suite
software engineering
automated testing
Innovation

Methods, ideas, or system contributions that make the work stand out.

test maintenance
large language models
automated testing
benchmark
mutation testing
🔎 Similar Papers
No similar papers found.
Elena Bruches
Elena Bruches
Старший преподаватель, Новосибирский Государственный Университет
обработка естественных языков
V
Vadim Alperovich
T-Technologies, Moscow, Russia
D
Dari Baturova
Siberian Neuronets LLC, Novosibirsk, Russia
R
Roman Derunets
Siberian Neuronets LLC, Novosibirsk, Russia
D
D. Grebenkin
Siberian Neuronets LLC, Novosibirsk, Russia
G
Georgy Mkrtchyan
T-Technologies, Moscow, Russia
O
Oleg Sedukhin
Siberian Neuronets LLC, Novosibirsk, Russia
M
Mikhail Klementev
Siberian Neuronets LLC, Novosibirsk, Russia
Ivan Bondarenko
Ivan Bondarenko
Researcher, Laboratory of Applied Digital Technologies, Novosibirsk State University
Deep LearningNatural Language ProcessingAutomatic Speech RecognitionAutomated Machine LearningFew-Shot Learning
N
Nikolay Bushkov
T-Technologies, Moscow, Russia
Stanislav Moiseev
Stanislav Moiseev
T-Technologies
computer scienceAImathematics