GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation

๐Ÿ“… 2026-05-12
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

206K/year
๐Ÿค– AI Summary
This work addresses the challenge of insufficient credit assignment in large language model (LLM) agents within reinforcement learning, where reliance solely on trajectory-level rewards hinders effective supervision of long-horizon decision-making. The authors propose an adaptive-granularity credit assignment framework that leverages teacherโ€“student self-distillation to generate token- and segment-level signals. By dynamically detecting semantic shift points, the method partitions trajectories into credit regions: preserving token-level precision in semantically consistent segments while aggregating and reweighting local advantages in shifted regions. Integrated with the GRPO algorithm, semantic-aware segmentation, and adaptive advantage modulation, the approach achieves substantial improvements over existing methods across eight benchmarks in mathematical reasoning and tool use, with gains up to 20% on low-baseline tasks, demonstrating its efficacy in complex, long-sequence settings.
๐Ÿ“ Abstract
Reinforcement learning has become a widely used post-training approach for LLM agents, where training commonly relies on outcome-level rewards that provide only coarse supervision. While finer-grained credit assignment is promising for effective policy updates, obtaining reliable local credit and assigning it to the right parts of the long-horizon trajectory remains an open challenge. In this paper, we propose Granularity-adaptivE Advantage Reweighting (GEAR), an adaptive-granularity credit assignment framework that reshapes the trajectory-level GRPO advantage using token- and segment-level signals derived from self-distillation. GEAR compares an on-policy student with a ground-truth-conditioned teacher to obtain a reference-guided divergence signal for identifying adaptive segment boundaries and modulating local advantage weights. This divergence often spikes at the onset of a semantic deviation, while later tokens in the same autoregressive continuation may return to low divergence. GEAR therefore treats such spikes as anchors for adaptive credit regions: where the student remains aligned with the teacher, token-level resolution is preserved; where it departs, GEAR groups the corresponding continuation into an adaptive segment and uses the divergence at the departure point to modulate the segment' s advantage. Experiments across eight mathematical reasoning and agentic tool-use benchmarks with Qwen3 4B and 8B models show that GEAR consistently outperforms standard GRPO, self-distillation-only baselines, and token- or turn-level credit-assignment methods. The gains are especially strong on benchmarks with lower GRPO baseline accuracy, reaching up to around 20\% over GRPO, suggesting that the proposed adaptive reweighting scheme is especially useful in more challenging long-horizon settings.
Problem

Research questions and friction points this paper is trying to address.

credit assignment
reinforcement learning
large language model agents
granularity
long-horizon trajectory
Innovation

Methods, ideas, or system contributions that make the work stand out.

adaptive credit assignment
self-distillation
advantage reweighting
granularity-adaptive
LLM agents
๐Ÿ”Ž Similar Papers
No similar papers found.