Agentic Property-Based Testing: Finding Bugs Across the Python Ecosystem

📅 2025-10-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Detecting cross-package and cross-function errors in Python’s ecosystem remains challenging due to the lack of scalable, semantics-aware automated testing methods. Method: This paper introduces the first framework that deeply integrates large language model (LLM) agents with property-based testing (PBT). The LLM agent automatically parses source code and documentation to infer function properties and inter-procedural invariants, then generates and executes randomized PBT suites. A defect reflection verification mechanism and priority-ranking strategy further improve true-vulnerability detection and report quality. Contribution/Results: Evaluated on 100 widely used Python packages, the framework achieves a 56% manually verified vulnerability rate; among high-priority reports, 86% correspond to genuine defects. Five vulnerabilities—including four patched ones—have been reported to projects including NumPy, with three patches accepted. This work establishes the first end-to-end, LLM-driven automation of PBT for defect discovery, significantly strengthening quality assurance for open-source software.

Technology Category

Application Category

📝 Abstract
Property-based testing (PBT) is a lightweight formal method, typically implemented as a randomized testing framework. Users specify the input domain for their test using combinators supplied by the PBT framework, and the expected properties or invariants as a unit-test function. The framework then searches for a counterexample, e.g. by generating inputs and calling the test function. In this work, we demonstrate an LLM-based agent which analyzes Python modules, infers function-specific and cross-function properties from code and documentation, synthesizes and executes PBTs, reflects on outputs of these tests to confirm true bugs, and finally outputs actionable bug reports for the developer. We perform an extensive evaluation of our agent across 100 popular Python packages. Of the bug reports generated by the agent, we found after manual review that 56% were valid bugs and 32% were valid bugs that we would report to maintainers. We then developed a ranking rubric to surface high-priority valid bugs to developers, and found that of the 21 top-scoring bugs, 86% were valid and 81% we would report. The bugs span diverse failure modes from serialization failures to numerical precision errors to flawed cache implementations. We reported 5 bugs, 4 with patches, including to NumPy and cloud computing SDKs, with 3 patches merged successfully. Our results suggest that LLMs with PBT provides a rigorous and scalable method for autonomously testing software. Our code and artifacts are available at: https://github.com/mmaaz-git/agentic-pbt.
Problem

Research questions and friction points this paper is trying to address.

Automating bug discovery in Python packages using LLM-driven property testing
Generating actionable bug reports by synthesizing and executing property tests
Validating testing effectiveness across 100 popular Python software packages
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM agent autonomously generates property-based tests
Agent analyzes code to infer and validate software properties
System produces actionable bug reports with patches
🔎 Similar Papers
No similar papers found.