The Curse of Helpfulness: Inverse Scaling Law in Robustness to Distractor Instructions via DistractionIF

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This study addresses a critical vulnerability in large language models (LLMs): their susceptibility to benign yet instruction-like semantic noise—such as editorial comments or system logs—in reference texts, which can inadvertently trigger unintended behaviors. Surprisingly, this issue worsens with increasing model scale, revealing a “curse of helpfulness.” To systematically evaluate robustness in instruction following, the authors introduce the DistractionIF benchmark and propose a reinforcement learning approach based on Group Relative Policy Optimization (GRPO) that rigorously enforces separation between instructions and non-instructional content. Experimental results demonstrate that model performance can degrade by up to 30 points as scale increases, whereas GRPO improves robustness by up to 15.5% without compromising general instruction-following capabilities.

📝 Abstract

Large Language Models (LLMs) are increasingly deployed in agentic and retrieval-augmented generation (RAG) systems, where they must execute user-specified tasks over externally provided reference text. In practice, such context is often unstructured and contaminated with benign but instruction-like semantic noise, such as editorial comments and system traces, which should be treated strictly as data. We introduce DistractionIF, a benchmark designed to evaluate robustness against such distractor instructions in reference text. Across a broad range of models, we observe a consistent inverse scaling phenomenon: larger models are often less robust, with performance dropping by up to 30 points as scale increases. Mechanistically, our perplexity analysis reveals that scaling erodes the probabilistic boundary between robust and distracted behaviors, making models increasingly prone to over-interpreting noise as instructions. To address this, we demonstrate that reinforcement learning, specifically Group Relative Policy Optimization (GRPO), can restore this boundary, improving robustness by up to 15.5% without compromising general instruction-following capability. Our findings highlight a critical instruction-following robustness gap in reference-grounded tasks and establish reinforcement learning as a promising path for enforcing strict data-instruction separation at scale.

Problem

Research questions and friction points this paper is trying to address.

instruction-following robustness

distractor instructions

inverse scaling

reference-grounded tasks

semantic noise

Innovation

Methods, ideas, or system contributions that make the work stand out.

inverse scaling law

instruction-following robustness

distractor instructions