🤖 AI Summary
This work addresses the degradation in reliability of large language models within long-sequence agent workflows caused by accumulating semantic ambiguities. To mitigate this, the authors propose a three-stage closed-loop framework that models multi-step reasoning as a noisy Markov decision process, enabling progressive denoising through perception, regulation, and correction. Key innovations include an uncertainty-aware adaptive computation allocation mechanism, an unsupervised online self-calibration method, and a synergistic integration of semantic uncertainty estimation, adaptive path exploration, impact-analysis-driven error correction, and verifier feedback alignment. Evaluated across six benchmarks, the approach achieves an average accuracy of 83.3%, outperforming the strongest baseline by 1.3%, while reducing computational overhead by 40–56% through dynamic branching.
📝 Abstract
Autonomous agents are increasingly entrusted with complex, long-horizon tasks, ranging from mathematical reasoning to software generation. While agentic workflows facilitate these tasks by decomposing them into multi-step reasoning chains, reliability degrades significantly as the sequence lengthens. Specifically, minor interpretation errors in natural-language instructions tend to compound silently across steps. We term this failure mode accumulated semantic ambiguity. Existing approaches to mitigate this often lack runtime adaptivity, relying instead on static exploration budgets, reactive error recovery, or single-path execution that ignores uncertainty entirely. We formalize the multi-step reasoning process as a Noisy MDP and propose DenoiseFlow, a closed-loop framework that performs progressive denoising through three coordinated stages: (1)Sensing estimates per-step semantic uncertainty; (2)Regulating adaptively allocates computation by routing between fast single-path execution and parallel exploration based on estimated risk; and (3)Correcting performs targeted recovery via influence-based root-cause localization. Online self-calibration continuously aligns decision boundaries with verifier feedback, requiring no ground-truth labels. Experiments on six benchmarks spanning mathematical reasoning, code generation, and multi-hop QA show that DenoiseFlow achieves the highest accuracy on every benchmark (83.3% average, +1.3% over the strongest baseline) while reducing cost by 40--56% through adaptive branching. Detailed ablation studies further confirm framework-level's robustness and generality. Code is available at https://anonymous.4open.science/r/DenoiseFlow-21D3/.