🤖 AI Summary
This work addresses the tendency of large language models (LLMs) in reinforcement learning–driven agent search to generate plausible yet unreliable responses when operating beyond their knowledge boundaries, due to a lack of self-awareness regarding their epistemic limits. To mitigate this issue, the authors propose the Boundary-Aware Policy Optimization (BAPO) framework, which introduces boundary awareness into the agent’s search policy for the first time. BAPO employs a grouped boundary-aware reward mechanism coupled with an adaptive reward modulator that selectively encourages “I don’t know” (IDK) responses only when the model’s reasoning reaches its capability boundary, thereby preventing IDK from being exploited as an avoidance strategy. Experiments across four benchmarks demonstrate that BAPO significantly enhances response reliability, effectively reduces hallucinations, and maintains high accuracy.
📝 Abstract
RL-based agentic search enables LLMs to solve complex questions via dynamic planning and external search. While this approach significantly enhances accuracy with agent policies optimized via large-scale reinforcement learning, we identify a critical gap in reliability: these agents fail to recognize their reasoning boundaries and rarely admit ``I DON'T KNOW''(IDK) even when evidence is insufficient or reasoning reaches its limit. The lack of reliability often leads to plausible but unreliable answers, introducing significant risks in many real-world scenarios. To this end, we propose Boundary-Aware Policy Optimization (BAPO), a novel RL framework designed to cultivate reliable boundary awareness without compromising accuracy. BAPO introduces two key components: (i) a group-based boundary-aware reward that encourages an IDK response only when the reasoning reaches its limit, and (ii) an adaptive reward modulator that strategically suspends this reward during early exploration, preventing the model from exploiting IDK as a shortcut. Extensive experiments on four benchmarks demonstrate that BAPO substantially enhances the overall reliability of agentic search.