AIPO: : Learning to Reason from Active Interaction

📅 2026-05-08
📈 Citations: 0
Influential: 0
📄 PDF

career value

178K/year
🤖 AI Summary
This work addresses the limitations of existing reinforcement learning methods, which are constrained by the intrinsic capabilities of policy models and suffer from low sample efficiency and sparse feedback due to reliance on complete expert trajectories. To overcome these challenges, the authors propose the AIPO framework, which introduces three collaborative agents—verification, knowledge, and reasoning—that provide fine-grained, dynamic feedback during training to actively help the policy model surpass its reasoning bottlenecks, while operating independently during inference. AIPO pioneers on-policy guidance through multi-agent active interaction during exploration, integrating importance sampling and gradient clipping to mitigate off-policy bias and vanishing gradients. The approach demonstrates significant performance gains across multiple benchmarks—including AIME, MATH500, GPQA-Diamond, and LiveCodeBench—and exhibits strong cross-model and cross-algorithm generalization, effectively expanding the reasoning boundaries of large language models.
📝 Abstract
Recent advances in large language models (LLMs) have demonstrated remarkable reasoning capabilities, largely stimulated by Reinforcement Learning with Verifiable Rewards (RLVR). However, existing RL algorithms face a fundamental limitation: their exploration remains largely constrained by the inherent capability boundary of the policy model. Although recent methods introduce external expert demonstrations to extend this boundary, they typically rely on complete trajectory-level guidance, which is sample-inefficient, information-sparse, and may confine exploration to a static guidance space. Inspired by the potential of multi-agent systems, we propose $\textbf{AIPO}$, an enhanced reinforcement learning framework that improves LLM reasoning through active multi-agent interaction during exploration. Specifically, AIPO enables the policy model to proactively consult three functional collaborative agents, $\textit{Verify Agent}$, $\textit{Knowledge Agent}$, and $\textit{Reasoning Agent}$, when encountering reasoning bottlenecks, thereby receiving fine-grained and targeted guidance to actively expand its capability boundary during training. We further introduce a tailored importance sampling coefficient together with a clipping strategy to mitigate the off-policy bias and gradient vanishing issues that arise when learning from agent-provided feedback. After training, the policy model performs reasoning independently without relying on collaborative agents. Extensive experiments on diverse reasoning benchmarks, including AIME, MATH500, GPQA-Diamond, and LiveCodeBench, show that AIPO consistently improves reasoning performance, generalizes robustly across different policy models and RLVR algorithms, and effectively expands the reasoning capability boundary of the policy model.
Problem

Research questions and friction points this paper is trying to address.

reinforcement learning
reasoning capability
exploration limitation
trajectory-level guidance
capability boundary
Innovation

Methods, ideas, or system contributions that make the work stand out.

Active Multi-Agent Interaction
Reinforcement Learning with Verifiable Rewards
Capability Boundary Expansion
Fine-Grained Guidance
Off-Policy Bias Mitigation