Attention as a Compass: Efficient Exploration for Process-Supervised RL in Reasoning Models

📅 2025-09-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Process-supervised reinforcement learning (PSRL) for reasoning models suffers from inefficient exploration—specifically, inaccurate branch localization and low sampling/training efficiency. Method: We propose the Attention-Guided Efficient exploration framework (AGE), which explicitly models attention scores as a measure of step-wise criticality along reasoning paths to precisely identify steps requiring branching exploration. AGE further introduces an adaptive sampling strategy and a single-step off-policy training pipeline, decoupling exploration from optimization. Results: Evaluated on multiple mathematical reasoning benchmarks, AGE significantly outperforms existing PSRL methods—achieving comparable or superior final performance while substantially reducing sampling overhead and training time. The framework thus delivers both effective exploration and computational efficiency.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning (RL) has shown remarkable success in enhancing the reasoning capabilities of Large Language Models (LLMs). Process-Supervised RL (PSRL) has emerged as a more effective paradigm compared to outcome-based RL. However, existing PSRL approaches suffer from limited exploration efficiency, both in terms of branching positions and sampling. In this paper, we introduce a novel PSRL framework (AttnRL), which enables efficient exploration for reasoning models. Motivated by preliminary observations that steps exhibiting high attention scores correlate with reasoning behaviors, we propose to branch from positions with high values. Furthermore, we develop an adaptive sampling strategy that accounts for problem difficulty and historical batch size, ensuring that the whole training batch maintains non-zero advantage values. To further improve sampling efficiency, we design a one-step off-policy training pipeline for PSRL. Extensive experiments on multiple challenging mathematical reasoning benchmarks demonstrate that our method consistently outperforms prior approaches in terms of performance and sampling and training efficiency.
Problem

Research questions and friction points this paper is trying to address.

Addressing limited exploration efficiency in process-supervised reinforcement learning
Improving branching positions and sampling strategies for reasoning models
Enhancing training efficiency through adaptive sampling and off-policy methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Branching from high attention positions
Adaptive sampling based on difficulty
One-step off-policy training pipeline
🔎 Similar Papers
No similar papers found.