AWPO: Enhancing Tool-Use of Large Language Models through Explicit Integration of Reasoning Rewards

๐Ÿ“… 2025-12-22
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing tool-use methods rely solely on sparse, verifiable outcome-based rewards, neglecting explicit modeling of the reasoning processโ€”leading to weak reasoning capabilities and insufficient reward signals in LLMs. To address this, we propose an Advantage-Weighted Policy Optimization (AWPO) framework that introduces, for the first time, a variance-aware gating mechanism and a difficulty-aware weighting scheme to adaptively fuse explicit reasoning rewards with outcome rewards. We further design customized advantage clipping to mitigate objective conflicts and ensure training stability. Built upon policy gradient optimization and dynamic multi-stage reward integration, our method achieves state-of-the-art performance on standard tool-use benchmarks: a 4B-parameter model attains a 16.0% absolute improvement in multi-turn accuracy over Grok-4, while maintaining strong generalization on MMLU-Pro.

Technology Category

Application Category

๐Ÿ“ Abstract
While reinforcement learning (RL) shows promise in training tool-use large language models (LLMs) using verifiable outcome rewards, existing methods largely overlook the potential of explicit reasoning rewards to bolster reasoning and tool utilization. Furthermore, natively combining reasoning and outcome rewards may yield suboptimal performance or conflict with the primary optimization objective. To address this, we propose advantage-weighted policy optimization (AWPO) -- a principled RL framework that effectively integrates explicit reasoning rewards to enhance tool-use capability. AWPO incorporates variance-aware gating and difficulty-aware weighting to adaptively modulate advantages from reasoning signals based on group-relative statistics, alongside a tailored clipping mechanism for stable optimization. Extensive experiments demonstrate that AWPO achieves state-of-the-art performance across standard tool-use benchmarks, significantly outperforming strong baselines and leading closed-source models in challenging multi-turn scenarios. Notably, with exceptional parameter efficiency, our 4B model surpasses Grok-4 by 16.0 percent in multi-turn accuracy while preserving generalization capability on the out-of-distribution MMLU-Pro benchmark.
Problem

Research questions and friction points this paper is trying to address.

Integrates reasoning rewards to enhance tool-use in LLMs
Addresses suboptimal performance from combining reasoning and outcome rewards
Improves multi-turn accuracy and generalization in tool-use benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Advantage-weighted policy optimization integrates reasoning rewards
Variance-aware gating adaptively modulates reasoning advantages
Difficulty-aware weighting and clipping ensure stable optimization
๐Ÿ”Ž Similar Papers
No similar papers found.
Zihan Lin
Zihan Lin
Researcher, Xiaohongshu.
Recommender System
X
Xiaohan Wang
Meituan
H
Hexiong Yang
Institute of Automation, Chinese Academy of Sciences
Jiajun Chai
Jiajun Chai
Meituan Inc.
Reinforcement LearningLLMsAgentic Learning
J
Jie Cao
Institute of Automation, Chinese Academy of Sciences
Guojun Yin
Guojun Yin
Meituan, University of Science and Technology of China
MultimodalityComputer VisionFoundation ModelsDeep LearningImage/Video Processing
W
Wei Lin
Meituan
R
Ran He
Institute of Automation, Chinese Academy of Sciences