Beyond Surface Similarity: Evaluating LLM-Based Test Refactorings with Structural and Semantic Awareness

πŸ“… 2025-06-07
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing automated unit test refactoring by LLMs lacks reliable, multidimensional evaluation metrics. Method: This paper proposes CTSES, the first composite evaluation framework that jointly models behavioral preservation, semantic consistency, readability, and structural validity. CTSES integrates CodeBLEU, METEOR, and ROUGE-L to enable joint quantification of renaming, structural reorganization, and semantic equivalence. We conduct large-scale experiments on the Defects4J and SF110 Java benchmarks using GPT-4o and Mistral-Large-2407, enhanced with Chain-of-Thought prompting. Results: Evaluated across 5,000+ test suites, CTSES significantly outperforms prior metrics (p < 0.01) and achieves high agreement with developer judgments and human evaluations (Spearman’s ρ = 0.89). The framework substantially enhances the trustworthiness and practical utility of LLM-driven test refactoring.

Technology Category

Application Category

πŸ“ Abstract
Large Language Models (LLMs) are increasingly employed to automatically refactor unit tests, aiming to enhance readability, naming, and structural clarity while preserving functional behavior. However, evaluating such refactorings remains challenging: traditional metrics like CodeBLEU are overly sensitive to renaming and structural edits, whereas embedding-based similarities capture semantics but ignore readability and modularity. We introduce CTSES, a composite metric that integrates CodeBLEU, METEOR, and ROUGE-L to balance behavior preservation, lexical quality, and structural alignment. CTSES is evaluated on over 5,000 test suites automatically refactored by GPT-4o and Mistral-Large-2407, using Chain-of-Thought prompting, across two established Java benchmarks: Defects4J and SF110. Our results show that CTSES yields more faithful and interpretable assessments, better aligned with developer expectations and human intuition than existing metrics.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM-based test refactorings with structural and semantic awareness
Balancing behavior preservation, lexical quality, and structural alignment
Improving assessment fidelity and interpretability for refactored unit tests
Innovation

Methods, ideas, or system contributions that make the work stand out.

Composite metric CTSES balances behavior and structure
Integrates CodeBLEU, METEOR, and ROUGE-L metrics
Evaluated on GPT-4o and Mistral-Large-2407 refactorings
πŸ”Ž Similar Papers
No similar papers found.
W
Wendkuuni C. Ou'edraogo
University of Luxembourg, Luxembourg
Y
Yinghua Li
University of Luxembourg, Luxembourg
Xueqi Dang
Xueqi Dang
SnT, University of Luxembourg
Machine Learning TestingSoftware Engineering
X
Xin Zhou
Singapore Management University, Singapore
Anil Koyuncu
Anil Koyuncu
SnT, University of Luxembourg
Jacques Klein
Jacques Klein
University of Luxembourg / SnT
Computer ScienceSoftware EngineeringAndroid SecuritySoftware SecurityModel-Driven Engineering
D
David Lo
Singapore Management University, Singapore
T
Tegawend'e F. Bissyand'e
University of Luxembourg, Luxembourg