Mind the Gap: A Readability-Aware Metric for Test Code Complexity

📅 2025-06-07

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Existing code complexity metrics (e.g., cyclomatic and cognitive complexity) are designed for functional code and fail to accurately assess the readability and structural quality of automatically generated unit tests—particularly those produced by LLMs—often yielding distorted, unrealistically low scores. Method: We propose CCTR (Cognitive Complexity for Test code), the first cognitive complexity metric tailored specifically to test code. CCTR uniquely integrates test-specific structural and semantic features—including assertion density, annotation semantic roles, and test composition patterns—via static analysis and semantic parsing. Contribution/Results: Evaluated on 15,750 test suites from Defects4J and SF110 across EvoSuite, GPT-4o, and Mistral, CCTR robustly distinguishes well-structured from fragmented tests and achieves strong correlation (ρ = 0.89) with developer-perceived effort. All data, prompts, and evaluation scripts are publicly released.

Technology Category

Application Category

📝 Abstract

Automatically generated unit tests-from search-based tools like EvoSuite or LLMs-vary significantly in structure and readability. Yet most evaluations rely on metrics like Cyclomatic Complexity and Cognitive Complexity, designed for functional code rather than test code. Recent studies have shown that SonarSource's Cognitive Complexity metric assigns near-zero scores to LLM-generated tests, yet its behavior on EvoSuite-generated tests and its applicability to test-specific code structures remain unexplored. We introduce CCTR, a Test-Aware Cognitive Complexity metric tailored for unit tests. CCTR integrates structural and semantic features like assertion density, annotation roles, and test composition patterns-dimensions ignored by traditional complexity models but critical for understanding test code. We evaluate 15,750 test suites generated by EvoSuite, GPT-4o, and Mistral Large-1024 across 350 classes from Defects4J and SF110. Results show CCTR effectively discriminates between structured and fragmented test suites, producing interpretable scores that better reflect developer-perceived effort. By bridging structural analysis and test readability, CCTR provides a foundation for more reliable evaluation and improvement of generated tests. We publicly release all data, prompts, and evaluation scripts to support replication.

Problem

Research questions and friction points this paper is trying to address.

Existing metrics fail to evaluate test code complexity accurately

Current complexity models ignore test-specific structural and semantic features

Lack of reliable metric for assessing readability of generated unit tests

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces CCTR for test code complexity

Integrates assertion density and test patterns

Evaluates 15,750 test suites effectively

🔎 Similar Papers

TestGenEval: A Real World Unit Test Generation and Test Completion Benchmark