🤖 AI Summary
This paper identifies “instructional distraction” in large language models (LLMs): a failure mode wherein models conflate task instructions with instruction-like content embedded in the input, leading to degraded instruction-following performance. To address this, the authors formally define the phenomenon and introduce DIM-Bench—the first dedicated multitask benchmark capturing cross-interference between four instruction-following tasks and five input-content tasks. Leveraging prompt engineering and adversarial input construction, they propose a task-decoupled evaluation framework and conduct systematic zero-shot and few-shot assessments across state-of-the-art LLMs. Results reveal that all evaluated models suffer substantial degradation—user intent adherence drops by 30–65%—highlighting a critical robustness bottleneck in current LLMs when inputs exhibit inherent instructional structure.
📝 Abstract
Despite the fact that large language models (LLMs) show exceptional skill in instruction following tasks, this strength can turn into a vulnerability when the models are required to disregard certain instructions. Instruction-following tasks typically involve a clear task description and input text containing the target data to be processed. However, when the input itself resembles an instruction, confusion may arise, even if there is explicit prompting to distinguish between the task instruction and the input. We refer to this phenomenon as instructional distraction. In this paper, we introduce a novel benchmark, named DIM-Bench, specifically designed to assess LLMs' performance under instructional distraction. The benchmark categorizes real-world instances of instructional distraction and evaluates LLMs across four instruction tasks: rewriting, proofreading, translation, and style transfer -- alongside five input tasks: reasoning, code generation, mathematical reasoning, bias detection, and question answering. Our experimental results reveal that even the most advanced LLMs are susceptible to instructional distraction, often failing to accurately follow user intent in such cases.