A-ProS: Towards Reliable Autonomous Programming Through Multi-Model Feedback

📅 2026-05-18

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses the challenge that large language models struggle to effectively leverage execution feedback for iterative improvement in competitive programming. To overcome this limitation, the authors propose A-ProS, an autonomous agent featuring a hybrid multi-model feedback framework that decouples code generation from debugging and incorporates a stateful collaborative debugging mechanism, substantially outperforming stateless approaches. Evaluated on 367 real-world competition problems, the study constructs six workflows combining GPT-4 or GPT-5 as generators with Codestral-2508, Llama-3.3-70B, and DeepSeek-R1 as critic models. After three optimization rounds, the GPT-5–based workflow achieves pass rates of 85–90, up from an initial 39—a performance gain more than double that of the baseline—while reducing repeated failures by 3.5×, thereby providing the first empirical validation that multi-model feedback substantially enhances the reliability of autonomous programming.

📝 Abstract

Large Language Models (LLMs) demonstrate strong potential for automated code generation, yet their ability to iteratively refine solutions using execution feedback remains underexplored. Competitive programming offers an ideal testbed for this investigation, as it demands end-to-end algorithmic reasoning, precise implementation under strict computational constraints, and complete functional correctness with rigorous evaluation. In this paper, we present A-ProS, an autonomous AI agent that solves competitive programming problems through a hybrid multi-model feedback framework separating solution generation from specialized debugging. A-ProS combines ChatGPT-based generators (GPT-4 and GPT-5) with three debugging critics: Codestral-2508, Llama-3.3-70B, and DeepSeek-R1, under a 2 x 3 factorial design. We evaluate six workflows on 367 problems from ICPC World Finals (2011-2024) and Codeforces (rated 1200-1800). The results show that GPT-5 workflows improve from 39 initial accepted solutions to 85-90 after three refinement rounds, while GPT-4 improves from 15 to 31-38. A controlled ablation on 47 problems shows that stateful refinement outperforms stateless approaches by 8.5-10.6 percentage points and reduces repeated failures by up to 3.5x. Compared to baseline agent loops, A-ProS achieves over 2x greater gains, highlighting the importance of persistent context and multi-model feedback for reliable autonomous program synthesis.

Problem

Research questions and friction points this paper is trying to address.

autonomous programming

code generation

execution feedback

competitive programming

program synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-model feedback

autonomous programming

stateful refinement