🤖 AI Summary
Existing search-based software testing (SBST) and large language model (LLM)-based approaches struggle to cover hard-to-test branches involving complex object construction and cross-procedural dependencies. To address this, we propose a semantics-aware LLM-based test generation method that integrates static program analysis with feedback-driven prompt engineering. Our approach introduces a novel three-stage mechanism: (1) extraction of realistic invocation contexts, (2) identification of cross-procedural dependencies, and (3) counterexample-guided iterative refinement—enabling LLMs to precisely comprehend target method semantics and constraints. By jointly modeling control and data flows, incorporating counterexample feedback for learning, and dynamically reconstructing prompts, our method achieves targeted triggering of hard-to-cover branches. Evaluated on 27 open-source Python projects, it improves average branch coverage by 31.39% over SBST and by 22.22% over LLM baselines, significantly enhancing testability of complex execution paths.
📝 Abstract
Automatic test generation plays a critical role in software quality assurance. While the recent advances in Search-Based Software Testing (SBST) and Large Language Models (LLMs) have shown promise in generating useful tests, these techniques still struggle to cover certain branches. Reaching these hard-to-cover branches usually requires constructing complex objects and resolving intricate inter-procedural dependencies in branch conditions, which poses significant challenges for existing test generation techniques. In this work, we propose TELPA, a novel technique aimed at addressing these challenges. Its key insight lies in extracting real usage scenarios of the target method under test to learn how to construct complex objects and extracting methods entailing inter-procedural dependencies with hard-to-cover branches to learn the semantics of branch constraints. To enhance efficiency and effectiveness, TELPA identifies a set of ineffective tests as counter-examples for LLMs and employs a feedback-based process to iteratively refine these counter-examples. Then, TELPA integrates program analysis results and counter-examples into the prompt, guiding LLMs to gain deeper understandings of the semantics of the target method and generate diverse tests that can reach the hard-to-cover branches. Our experimental results on 27 open-source Python projects demonstrate that TELPA significantly outperforms the state-of-the-art SBST and LLM-based techniques, achieving an average improvement of 31.39% and 22.22% in terms of branch coverage.