Less Noise, More Voice: Reinforcement Learning for Reasoning via Instruction Purification

📅 2026-01-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of reinforcement learning (RL) in large language model (LLM) inference, where low exploration efficiency, poor sampling success rates, and training instability—particularly under complex tasks and limited rollout budgets—hinder performance. The authors propose LENS, a novel framework that identifies spurious tokens in prompts as a primary cause of exploration failure and introduces an instruction purification mechanism: by detecting and removing such distracting tokens, LENS generates high-quality rollouts to supervise policy optimization under the original noisy prompts. Integrating reinforcement learning with verifiable rewards (RLVR) and cross-prompt policy transfer, LENS significantly enhances performance while preserving adaptability to realistic noisy environments. Compared to GRPO, LENS achieves an average performance gain of 3.88% and accelerates convergence by over 1.6×, establishing a new paradigm for RLVR research.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has advanced LLM reasoning, but remains constrained by inefficient exploration under limited rollout budgets, leading to low sampling success and unstable training in complex tasks. We find that many exploration failures arise not from problem difficulty, but from a small number of prompt tokens that introduce interference. Building on this insight, we propose the Less Noise Sampling Framework (LENS), which first prompts by identifying and removing interference tokens. then transfers successful rollouts from the purification process to supervise policy optimization on the original noisy prompts, enabling the model to learn to ignore interference in the real-world, noisy prompting settings. Experimental results show that LENS significantly outperforms GRPO, delivering higher performance and faster convergence, with a 3.88% average gain and over 1.6$\times$ speedup. Our work highlights the critical role of pruning interference tokens in improving rollout efficiency, offering a new perspective for RLVR research.
Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning
LLM reasoning
prompt interference
rollout efficiency
sampling success
Innovation

Methods, ideas, or system contributions that make the work stand out.

Instruction Purification
Reinforcement Learning with Verifiable Rewards
Interference Token Pruning
Rollout Efficiency
Policy Optimization
🔎 Similar Papers
No similar papers found.