TestExplora: Benchmarking LLMs for Proactive Bug Discovery via Repository-Level Test Generation

📅 2026-02-11

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This work addresses the limited capability of existing large language models (LLMs) in proactively discovering software defects, as current approaches primarily focus on regression prevention or fault reproduction. To bridge this gap, we introduce TestExplora, a novel benchmark that establishes the first evaluation paradigm for active defect discovery by LLMs. In this framework, defect signals are concealed within complete code repositories, and models must infer expected behaviors from documentation to generate test cases that expose implementation errors. The benchmark leverages documentation as the source of intended behavior, incorporates cross-module interaction analysis, employs agent-driven exploration strategies, and features a time-aware data collection mechanism to prevent data leakage. Experimental results show that even the best-performing model achieves only a 16.06% found-to-patch (F2P) rate; integrating SWE-Agent with GPT-5-mini improves F2P to 17.27% and F2P@5 to 29.7%, demonstrating the effectiveness of our approach.

Technology Category

Application Category

📝 Abstract

Given that Large Language Models (LLMs) are increasingly applied to automate software development, comprehensive software assurance spans three distinct goals: regression prevention, reactive reproduction, and proactive discovery. Current evaluations systematically overlook the third goal. Specifically, they either treat existing code as ground truth (a compliance trap) for regression prevention, or depend on post-failure artifacts (e.g., issue reports) for bug reproduction-so they rarely surface defects before failures. To bridge this gap, we present TestExplora, a benchmark designed to evaluate LLMs as proactive testers within full-scale, realistic repository environments. TestExplora contains 2,389 tasks from 482 repositories and hides all defect-related signals. Models must proactively find bugs by comparing implementations against documentation-derived intent, using documentation as the oracle. Furthermore, to keep evaluation sustainable and reduce leakage, we propose continuous, time-aware data collection. Our evaluation reveals a significant capability gap: state-of-the-art models achieve a maximum Fail-to-Pass (F2P) rate of only 16.06%. Further analysis indicates that navigating complex cross-module interactions and leveraging agentic exploration are critical to advancing LLMs toward autonomous software quality assurance. Consistent with this, SWEAgent instantiated with GPT-5-mini achieves an F2P of 17.27% and an F2P@5 of 29.7%, highlighting the effectiveness and promise of agentic exploration in proactive bug discovery tasks.

Problem

Research questions and friction points this paper is trying to address.

proactive bug discovery

LLM evaluation

software assurance

test generation

repository-level testing

Innovation

Methods, ideas, or system contributions that make the work stand out.

proactive bug discovery

repository-level test generation

LLM evaluation benchmark