TDFlow: Agentic Workflows for Test Driven Software Engineering

📅 2025-10-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of manual test interpretation and low repair success rates in warehouse-scale program repair, this paper proposes TDFlow: a multi-agent workflow for test-driven software engineering. TDFlow decomposes the repair task into four specialized subtasks—patch generation, debugging, revision, and optional test generation—executed iteratively by coordinated, large language model–driven agents constrained by a curated toolchain. This decoupling significantly reduces contextual overhead and enhances precision across each subtask. Evaluated on SWE-Bench Lite and Verified benchmarks, TDFlow achieves pass rates of 88.8% (+27.8% absolute improvement) and 94.3%, respectively. Manual validation confirms an exceptionally low test deception rate, demonstrating TDFlow’s effectiveness and robustness under realistic, human-curated test scenarios.

Technology Category

Application Category

📝 Abstract
We introduce TDFlow, a novel test-driven agentic workflow that frames repository-scale software engineering as a test-resolution task, specifically designed to solve human-written tests. Given a set of tests, TDFlow repeatedly proposes, revises, and debugs repository-scale patches using precisely engineered sub-agents and tightly constrained tools. The workflow decomposes software engineering program repair into four components governed by respective sub-agents. This simple, forced decoupling of patch proposing, debugging, patch revision, and optional test generation (1) reduces long-context burden on any individual sub-agent, (2) focuses each sub-agent on specific, pre-defined sub-tasks, and (3) allows for specialized performance improvement on specific sub-tasks. When provided human-written tests, TDFlow attains 88.8% pass rate on SWE-Bench Lite (an absolute improvement of 27.8% over the next best system) and 94.3% on SWE-Bench Verified. Manual inspection of the 800 TDFlow runs within SWE-Bench Lite and Verified uncover only 7 instances of test hacking, which were subsequently counted as failures. Furthermore, we show that the primary obstacle to human-level software engineering performance lies within writing successful reproduction tests. We envision a human-LLM interactive system powered by TDFlow where human developers write tests solved by LLM systems. Together, these results indicate that modern LLMs, when embedded in a narrowly engineered, test-driven workflow, already achieve human-level test resolution -- with the final frontier for fully autonomous repository repair being the accurate generation of valid reproduction tests.
Problem

Research questions and friction points this paper is trying to address.

Solving repository-scale software engineering via test-driven workflows
Automating patch proposal and debugging using specialized sub-agents
Addressing human-level performance gaps in reproduction test generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Agentic workflow frames software engineering as test-resolution
Decomposes repair into four specialized sub-agent components
Uses tightly constrained tools to propose revise debug patches
🔎 Similar Papers
No similar papers found.
K
Kevin Han
Carnegie Mellon University
S
Siddharth Maddikayala
UC San Diego
T
Tim Knappe
Carnegie Mellon University
Om Patel
Om Patel
Children's Hospital of Philadelphia
Machine Learning and Biology
A
Austen Liao
Johns Hopkins University
Amir Barati Farimani
Amir Barati Farimani
Russell V. Trader Associate Professor at Carnegie Mellon University
Computational systemsMulti-scale modelingBiophysicsDeep Learning