SPS: Steering Probability Squeezing for Better Exploration in Reinforcement Learning for Large Language Models

📅 2026-04-18

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses a critical exploration bottleneck in reinforcement learning (RL) for training reasoning-based large language models, where excessive focus on high-reward trajectories induces probability collapse, stifling exploration and degrading multi-sample performance (Pass@k). To mitigate this, the authors propose SPS, a novel framework that alternates between RL and inverse reinforcement learning (IRL) without requiring external supervision. By treating on-policy samples as implicit demonstrations, IRL dynamically reshapes the trajectory reward landscape to promote diverse reasoning paths. Evaluated across five reasoning benchmarks, SPS substantially improves Pass@k performance and establishes an empirical upper bound on its effectiveness, demonstrating both theoretical insight and practical utility.

Technology Category

Application Category

📝 Abstract

Reinforcement learning (RL) has emerged as a promising paradigm for training reasoning-oriented models by leveraging rule-based reward signals. However, RL training typically tends to improve single-sample success rates (i.e., Pass@1) while offering limited exploration of diverse reasoning trajectories, which is crucial for multi-sample performance (i.e., Pass@k). Our preliminary analysis reveals that this limitation stems from a fundamental squeezing effect, whereby probability mass is excessively concentrated on a narrow subset of high-reward trajectories, restricting genuine exploration and constraining attainable performance under RL training. To address this issue, in this work, we propose Steering Probability Squeezing (SPS), a training paradigm that interleaves conventional RL with inverse reinforcement learning (IRL). SPS treats on-policy rollouts as demonstrations and employs IRL to explicitly reshape the induced trajectory distribution, thereby enhancing exploration without introducing external supervision. Experiments on five commonly used reasoning benchmarks demonstrate that SPS can enable better exploration and improve Pass@k. Beyond algorithmic contributions, we provide an analysis of RL learning dynamics and identify an empirical upper bound on Pass@k, shedding light on intrinsic exploration limits in RL-based reasoning models. Our findings suggest that alternating between RL and IRL offers an effective pathway toward extending the exploration capacity of reasoning-oriented large language models.

Problem

Research questions and friction points this paper is trying to address.

reinforcement learning

exploration

reasoning trajectories

probability squeezing

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Steering Probability Squeezing

Reinforcement Learning

Inverse Reinforcement Learning