Negative Advantage Is a Double-Edged Sword: Calibrating Advantage in GRPO for Deep Search

📅 2026-04-20

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This work addresses the instability and performance degradation of GRPO in deep reasoning tasks, which stem from coarse-grained advantage assignment and imbalanced positive-negative signal distribution, often leading to incorrect penalization of intermediate correct steps. To mitigate these issues, the authors propose CalibAdv, a novel method that introduces, for the first time, a fine-grained negative advantage calibration mechanism. CalibAdv dynamically down-weights excessively large negative advantages based on the correctness of intermediate reasoning steps and rebalances positive and negative advantage signals at the final answer stage. Extensive experiments demonstrate that CalibAdv significantly enhances training stability and question-answering performance, effectively alleviating training collapse and language capability degradation across three models and seven benchmark datasets.

Technology Category

Application Category

📝 Abstract

Deep search agents can autonomously initiate multi-turn interactions with search engines, thereby exhibiting strong question-answering capabilities. Such performance critically relies on Group Relative Policy Optimization (GRPO) as its core training algorithm. However, GRPO still faces several challenges in deep search settings. First, there exists a substantial mismatch between the correctness of intermediate steps and the reward signal, causing numerous correct intermediate steps to be incorrectly penalized when the final answer is wrong. Second, training is highly unstable, often resulting in degradation of natural language ability or even catastrophic training collapse. Our analysis attributes these issues to coarse-grained advantage assignment and an imbalance between positive and negative advantages. To address these problems, we propose CalibAdv, an advantage calibration method specifically designed for deep search tasks. Specifically, CalibAdv leverages the correctness of intermediate steps to downscale excessive negative advantages at a fine-grained level. It then rebalances positive and negative advantages in the answer component. Extensive experiments across three models and seven benchmarks demonstrate that CalibAdv improves both model performance and training stability. Our code is available at https://github.com/wujwyi/CalibAdv.

Problem

Research questions and friction points this paper is trying to address.

deep search

GRPO

advantage calibration

training instability

reward mismatch

Innovation

Methods, ideas, or system contributions that make the work stand out.

advantage calibration

deep search

GRPO