Less Noise, More Voice: Reinforcement Learning for Reasoning via Instruction Purification

📅 2026-01-29

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

This work addresses the limitations of reinforcement learning (RL) in large language model (LLM) inference, where low exploration efficiency, poor sampling success rates, and training instability—particularly under complex tasks and limited rollout budgets—hinder performance. The authors propose LENS, a novel framework that identifies spurious tokens in prompts as a primary cause of exploration failure and introduces an instruction purification mechanism: by detecting and removing such distracting tokens, LENS generates high-quality rollouts to supervise policy optimization under the original noisy prompts. Integrating reinforcement learning with verifiable rewards (RLVR) and cross-prompt policy transfer, LENS significantly enhances performance while preserving adaptability to realistic noisy environments. Compared to GRPO, LENS achieves an average performance gain of 3.88% and accelerates convergence by over 1.6×, establishing a new paradigm for RLVR research.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has advanced LLM reasoning, but remains constrained by inefficient exploration under limited rollout budgets, leading to low sampling success and unstable training in complex tasks. We find that many exploration failures arise not from problem difficulty, but from a small number of prompt tokens that introduce interference. Building on this insight, we propose the Less Noise Sampling Framework (LENS), which first prompts by identifying and removing interference tokens. then transfers successful rollouts from the purification process to supervise policy optimization on the original noisy prompts, enabling the model to learn to ignore interference in the real-world, noisy prompting settings. Experimental results show that LENS significantly outperforms GRPO, delivering higher performance and faster convergence, with a 3.88% average gain and over 1.6$\times$ speedup. Our work highlights the critical role of pruning interference tokens in improving rollout efficiency, offering a new perspective for RLVR research.

Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning

LLM reasoning

prompt interference

rollout efficiency

sampling success

Innovation

Methods, ideas, or system contributions that make the work stand out.

Instruction Purification

Reinforcement Learning with Verifiable Rewards

Interference Token Pruning