Planning to Explore: Curiosity-Driven Planning for LLM Test Generation

📅 2026-04-06

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses the limitation of existing large language model (LLM)-based test generation methods, which often rely on greedy strategies and struggle to cover deep code branches requiring multi-step setup. The paper introduces CovQValue, the first approach to integrate Bayesian exploration into LLM-driven test generation. It treats the program’s branch structure as an unknown environment and leverages coverage maps from evolutionary runs as a surrogate posterior. By prompting the LLM to generate diverse exploration plans in parallel and selecting the most informative paths based on LLM-estimated Q-values, CovQValue balances immediate bug discovery with long-term reachability. Experiments demonstrate that this method improves branch coverage by 51–77% on TestGenEval Lite, achieving win rates of 77–84%, and attains 40–74% coverage on the newly introduced RepoExploreBench, significantly outperforming baseline approaches.

Technology Category

Application Category

📝 Abstract

The use of LLMs for code generation has naturally extended to code testing and evaluation. As codebases grow in size and complexity, so does the need for automated test generation. Current approaches for LLM-based test generation rely on strategies that maximize immediate coverage gain, a greedy approach that plateaus on code where reaching deep branches requires setup steps that individually yield zero new coverage. Drawing on principles of Bayesian exploration, we treat the program's branch structure as an unknown environment, and an evolving coverage map as a proxy probabilistic posterior representing what the LLM has discovered so far. Our method, CovQValue, feeds the coverage map back to the LLM, generates diverse candidate plans in parallel, and selects the most informative plan by LLM-estimated Q-values, seeking actions that balance immediate branch discovery with future reachability. Our method outperforms greedy selection on TestGenEval Lite, achieving 51-77% higher branch coverage across three popular LLMs and winning on 77-84% of targets. In addition, we build a benchmark for iterative test generation, RepoExploreBench, where they achieve 40-74%. These results show the potential of curiosity-driven planning methods for LLM-based exploration, enabling more effective discovery of program behavior through sequential interaction

Problem

Research questions and friction points this paper is trying to address.

LLM test generation

branch coverage

greedy strategy

code exploration

automated testing

Innovation

Methods, ideas, or system contributions that make the work stand out.

curiosity-driven planning

Bayesian exploration

LLM-based test generation