π€ AI Summary
Existing automated test update approaches in continuous integration struggle to simultaneously ensure assertion adequacy and precisely target uncovered code paths, while also being susceptible to hallucinations from large language models (LLMs). This work proposes MuMuTestUpβthe first test update framework integrating mutation analysis with a multi-agent architecture. It employs three specialized agents that collaboratively strengthen assertions, generate targeted repair instructions, and incorporate semantic retrieval to mitigate LLM hallucinations. By focusing on specific uncovered lines or branches rather than coarse-grained coverage metrics, MuMuTestUp substantially enhances test quality. Experimental evaluation on the newly curated PRBENCH dataset demonstrates that MuMuTestUp significantly outperforms state-of-the-art baselines in assertion strength, coverage precision, and execution stability.
π Abstract
Modern software systems evolve rapidly under CI/CD practices, where tests are critical for quality. However, substantial code changes often render existing test cases obsolete, causing pipeline disruptions, reduced productivity, and compromised quality. Recent automatic test update approaches leverage LLMs to refine test cases via execution feedback and exact-matching context retrieval, prioritizing executability and line coverage but suffering three limitations: (1) neglecting test assertion adequacy, weakening fault detection; (2) relying on coarse line coverage instead of specific uncovered lines/branches; (3) using exact-matching retrieval, which fails for LLM hallucinated queries. To address these, we propose MuMuTestUp, a mutation-guided multi-agent framework with three specialized agents: Mutation Analysis (strengthens assertions via surviving mutants), Coverage Analysis (generates targeted repair instructions for uncovered lines/branches), and Semantic Retrieval (handles hallucinations via semantic-similarity search). We also construct PRBENCH, a 571-sample pull-request-level dataset from 10 open-source Java projects (validated for cross-commit update scenarios). Evaluations against state-of-the-art baselines use both open-source (Deepseek-V3.2) and closed-source (GPT-4.1) LLMs.