Nonsense Helps: Prompt Space Perturbation Broadens Reasoning Exploration

📅 2026-05-06
📈 Citations: 0
Influential: 0
📄 PDF

career value

183K/year
🤖 AI Summary
This work addresses the “zero-advantage problem” in reinforcement learning with large language models, where training signals vanish due to the consistent failure of all sampled reasoning trajectories. To mitigate this issue, the authors propose LoPE (Low-perplexity Perturbation for Exploration), a framework that stochastically inserts low-perplexity, semantically meaningless text—such as Lorem Ipsum—into prompts prior to inference. This perturbation disrupts task-irrelevant prompt structures and encourages the model to explore diverse reasoning paths. LoPE integrates Group Relative Policy Optimization (GRPO) with a resampling mechanism, thereby overcoming the limitations of static sampling strategies. Experimental results across models ranging from 1.7B to 7B parameters demonstrate that LoPE substantially outperforms baseline prompt-resampling methods, confirming the effectiveness and generalizability of low-perplexity pseudo-text perturbations in enhancing exploration during policy optimization.
📝 Abstract
Reinforcement learning with verifiable rewards, particularly Group Relative Policy Optimization (GRPO), has significantly advanced the reasoning capabilities of Large Language Models (LLMs). However, in complex tasks, GRPO frequently suffers from the ``zero-advantage problem'': when all sampled rollouts for a query fail, the relative advantage collapses to zero. Consequently, the model loses effective training signals for these questions, wasting the training data and computational budget. While simply increasing the sampling budget for these questions is a common remedy, the static sampling policy inherently constrains reasoning exploration, limiting the success rate. In this paper, we propose Lorem Perturbation for Exploration (LoPE), a simple yet effective training framework to break this exploration bottleneck. We posit that task-irrelevant prompt-space perturbations can shift the model's output distribution enough to unlock orthogonal reasoning pathways for hard questions. Specifically, LoPE prepends sequences stochastically assembled from Lorem Ipsum vocabulary (a pseudo-Latin placeholder text) to the prompts before resampling. Experiments across 1.7B, 4B, and 7B models demonstrate that LoPE significantly outperforms resampling with the original prompts. Further analysis reveals that other Latin-based random sequences with low perplexity are also effective perturbations. Our results establish LoPE as a strong baseline for broadening exploration in LLM reinforcement learning.
Problem

Research questions and friction points this paper is trying to address.

zero-advantage problem
reinforcement learning
reasoning exploration
Large Language Models
sampling bottleneck
Innovation

Methods, ideas, or system contributions that make the work stand out.

prompt perturbation
reinforcement learning
reasoning exploration
zero-advantage problem
LLM training
🔎 Similar Papers
No similar papers found.