Can LLMs Guide Their Own Exploration? Gradient-Guided Reinforcement Learning for LLM Reasoning

📅 2025-12-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing RL-based LLM reasoning methods—e.g., entropy-based rewards or external semantic comparisons—only encourage superficial sampling diversity and fail to ensure substantive divergence in exploration trajectories along parameter-update directions. To address this, we propose a gradient-guided, sequence-level exploration framework: (1) lightweight sequence features are constructed from last-layer gradient sensitivity; (2) novelty is quantified via intra-group gradient-direction orthogonality; (3) a self-referential, PPO-compatible multiplicative exploration reward is designed; and (4) KL divergence constraints preserve semantic consistency. This work is the first to align exploration mechanisms with the model’s intrinsic gradient geometry. Evaluated on MATH500, AMC, AIME24/25, GPQA, and MMLUpro, our method significantly improves pass@1, maj@16, and pass@k across Qwen3-1.7B and Qwen3-4B, while demonstrably enhancing gradient-space orthogonality.

Technology Category

Application Category

📝 Abstract
Reinforcement learning has become essential for strengthening the reasoning abilities of large language models, yet current exploration mechanisms remain fundamentally misaligned with how these models actually learn. Entropy bonuses and external semantic comparators encourage surface level variation but offer no guarantee that sampled trajectories differ in the update directions that shape optimization. We propose G2RL, a gradient guided reinforcement learning framework in which exploration is driven not by external heuristics but by the model own first order update geometry. For each response, G2RL constructs a sequence level feature from the model final layer sensitivity, obtainable at negligible cost from a standard forward pass, and measures how each trajectory would reshape the policy by comparing these features within a sampled group. Trajectories that introduce novel gradient directions receive a bounded multiplicative reward scaler, while redundant or off manifold updates are deemphasized, yielding a self referential exploration signal that is naturally aligned with PPO style stability and KL control. Across math and general reasoning benchmarks (MATH500, AMC, AIME24, AIME25, GPQA, MMLUpro) on Qwen3 base 1.7B and 4B models, G2RL consistently improves pass@1, maj@16, and pass@k over entropy based GRPO and external embedding methods. Analyzing the induced geometry, we find that G2RL expands exploration into substantially more orthogonal and often opposing gradient directions while maintaining semantic coherence, revealing that a policy own update space provides a far more faithful and effective basis for guiding exploration in large language model reinforcement learning.
Problem

Research questions and friction points this paper is trying to address.

Aligns exploration with LLM's gradient geometry for reasoning
Replaces external heuristics with self-referential gradient-guided exploration
Improves reasoning performance by diversifying orthogonal update directions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses model's gradient sensitivity for exploration guidance
Rewards novel gradient directions with bounded multiplicative scaler
Aligns exploration with policy update geometry naturally
🔎 Similar Papers
No similar papers found.