Review, Refine, Repeat: Understanding Iterative Decoding of AI Agents with Dynamic Evaluation and Selection

📅 2025-04-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
AI agents exhibit limited performance on complex tasks—such as multimodal understanding, structured generation, and strategic planning—especially under black-box API settings where standard fine-tuning is infeasible. Existing inference-time methods (e.g., Best-of-N) lack iterative feedback mechanisms, hindering progressive refinement. To address this, we propose Iterative Agent Decoding (IAD), a framework that integrates validator-guided dynamic candidate evaluation with multi-round refinement, enabling co-optimization of sampling and verification. Our key contribution is the first validator-driven iterative feedback paradigm, which systematically characterizes the critical role of validator quality in inference-time optimization and reveals scaling laws under noisy or sparse reward conditions. On Sketch2Code, Text2SQL, and WebShop benchmarks, IAD achieves absolute improvements of 3–6% and 8–10%, respectively. Ablation studies confirm that gains stem primarily from validator-guided refinement—not increased sampling diversity.

Technology Category

Application Category

📝 Abstract
While AI agents have shown remarkable performance at various tasks, they still struggle with complex multi-modal applications, structured generation and strategic planning. Improvements via standard fine-tuning is often impractical, as solving agentic tasks usually relies on black box API access without control over model parameters. Inference-time methods such as Best-of-N (BON) sampling offer a simple yet effective alternative to improve performance. However, BON lacks iterative feedback integration mechanism. Hence, we propose Iterative Agent Decoding (IAD) which combines iterative refinement with dynamic candidate evaluation and selection guided by a verifier. IAD differs in how feedback is designed and integrated, specifically optimized to extract maximal signal from reward scores. We conduct a detailed comparison of baselines across key metrics on Sketch2Code, Text2SQL, and Webshop where IAD consistently outperforms baselines, achieving 3--6% absolute gains on Sketch2Code and Text2SQL (with and without LLM judges) and 8--10% gains on Webshop across multiple metrics. To better understand the source of IAD's gains, we perform controlled experiments to disentangle the effect of adaptive feedback from stochastic sampling, and find that IAD's improvements are primarily driven by verifier-guided refinement, not merely sampling diversity. We also show that both IAD and BON exhibit inference-time scaling with increased compute when guided by an optimal verifier. Our analysis highlights the critical role of verifier quality in effective inference-time optimization and examines the impact of noisy and sparse rewards on scaling behavior. Together, these findings offer key insights into the trade-offs and principles of effective inference-time optimization.
Problem

Research questions and friction points this paper is trying to address.

Improving AI agent performance in complex multi-modal tasks
Enhancing iterative feedback integration in decoding methods
Optimizing verifier-guided dynamic evaluation and selection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Iterative refinement with dynamic evaluation
Verifier-guided feedback integration
Inference-time scaling with optimal verifier