Learning to Generate Unit Tests for Automated Debugging

📅 2025-02-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In LLM-based automated debugging, unit test generation faces a fundamental tension between generating *error-revealing inputs* and their corresponding *correct outputs*, which are difficult to produce in concert. Method: We propose UTGen, a method for joint input-output generation, integrated into the UTDebug framework—featuring test-time computation expansion, multi-test consistency verification, and edit backtracking to suppress noise. Technically, UTDebug combines instruction fine-tuning, test-time scaling, and multi-faceted validation strategies. Contributions/Results: UTGen achieves a 7.59% improvement over baselines on joint generation metrics. When integrated into UTDebug, Qwen-2.5 7B attains absolute pass@1 gains of 3.0% on HumanEvalFix and 12.35% on the challenging MBPP+ subset, significantly advancing LLMs’ autonomous debugging capability.

Technology Category

Application Category

📝 Abstract
Unit tests (UTs) play an instrumental role in assessing code correctness as well as providing feedback to a large language model (LLM) as it iteratively debugs faulty code, motivating automated test generation. However, we uncover a trade-off between generating unit test inputs that reveal errors when given a faulty code and correctly predicting the unit test output without access to the gold solution. To address this trade-off, we propose UTGen, which teaches LLMs to generate unit test inputs that reveal errors along with their correct expected outputs based on task descriptions and candidate code. We integrate UTGen into UTDebug, a robust debugging pipeline that uses generated tests to help LLMs debug effectively. Since model-generated tests can provide noisy signals (e.g., from incorrectly predicted outputs), UTDebug (i) scales UTGen via test-time compute to improve UT output prediction, and (ii) validates and back-tracks edits based on multiple generated UTs to avoid overfitting. We show that UTGen outperforms UT generation baselines by 7.59% based on a metric measuring the presence of both error-revealing UT inputs and correct UT outputs. When used with UTDebug, we find that feedback from UTGen's unit tests improves pass@1 accuracy of Qwen-2.5 7B on HumanEvalFix and our own harder debugging split of MBPP+ by over 3% and 12.35% (respectively) over other LLM-based UT generation baselines.
Problem

Research questions and friction points this paper is trying to address.

Automated Unit Testing
Large Language Models
Code Error Detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

UTGen
unit test generation
LLM debugging enhancement
🔎 Similar Papers
No similar papers found.