AgentForge: Execution-Grounded Multi-Agent LLM Framework for Autonomous Software Engineering

📅 2026-04-13

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work addresses the limited verifiability of code generated by current large language models, which often lacks real-world executability, and the insufficient correctness guarantees in existing agent systems that rely on simulation or optional testing. The authors propose a multi-agent collaborative framework that decouples distinct roles—planning, coding, testing, debugging, and critique—and coordinates them via shared memory. A key innovation is the mandatory validation of all code modifications through execution in a Docker sandbox before propagation, thereby integrating real execution feedback as a core mechanism into the multi-agent software engineering pipeline and enabling closed-loop iterative decision-making. Evaluated on SWE-Bench Lite, the approach achieves a 40.0% resolution rate, outperforming single-agent baselines by 26–28 percentage points. Ablation studies confirm the effectiveness of both execution-based verification and role decomposition.

Technology Category

Application Category

📝 Abstract

Large language models generate plausible code but cannot verify correctness. Existing multi-agent systems simulate execution or leave verification optional. We introduce execution-grounded verification as a first-class principle: every code change must survive sandboxed execution before propagation. We instantiate this principle in AGENTFORGE, a multi-agent framework where Planner, Coder, Tester, Debugger, and Critic agents coordinate through shared memory and a mandatory Docker sandbox. We formalize software engineering with LLMs as an iterative decision process over repository states, where execution feedback provides a stronger supervision signal than next-token likelihood. AGENTFORGE achieves 40.0\% resolution on SWE-BENCH Lite, outperforming single-agent baselines by 26--28 points. Ablations confirm that execution feedback and role decomposition each independently drive performance. The framework is open-source at https://github.com/raja21068/AutoCodeAI.

Problem

Research questions and friction points this paper is trying to address.

code verification

large language models

multi-agent systems

software engineering

execution feedback

Innovation

Methods, ideas, or system contributions that make the work stand out.

execution-grounded verification

multi-agent LLM framework

sandboxed execution