How Is LLM Reasoning Distracted by Irrelevant Context? An Analysis Using a Controlled Benchmark

📅 2025-05-24

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work investigates the robustness of large language models (LLMs) to irrelevant context (IC) interference in symbolic reasoning. To this end, we introduce GSM-DC, a controllable interference benchmark, and propose the first symbolic reasoning graph framework enabling precise, programmable IC injection. We further design a process-reward-guided stepwise tree search inference method and develop a strong interference-robust training strategy. Experiments reveal that LLMs are highly sensitive to IC: both reasoning path selection fidelity and arithmetic accuracy degrade concurrently. Strong interference training substantially improves out-of-distribution (OOD) generalization, boosting accuracy by up to 23.6%. Our tree search method achieves an 18.4% OOD accuracy gain over baselines. This is the first systematic quantification of IC interference effects in symbolic reasoning, providing an interpretable modeling framework and effective training/inference paradigms for enhancing LLM reasoning robustness.

Technology Category

Application Category

📝 Abstract

We introduce Grade School Math with Distracting Context (GSM-DC), a synthetic benchmark to evaluate Large Language Models' (LLMs) reasoning robustness against systematically controlled irrelevant context (IC). GSM-DC constructs symbolic reasoning graphs with precise distractor injections, enabling rigorous, reproducible evaluation. Our experiments demonstrate that LLMs are significantly sensitive to IC, affecting both reasoning path selection and arithmetic accuracy. Additionally, training models with strong distractors improves performance in both in-distribution and out-of-distribution scenarios. We further propose a stepwise tree search guided by a process reward model, which notably enhances robustness in out-of-distribution conditions.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM reasoning robustness against irrelevant context

Assessing LLM sensitivity to distractions in reasoning tasks

Improving LLM performance with controlled distractor training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic benchmark with controlled distractors

Training models using strong distractors

Stepwise tree search guided by reward

🔎 Similar Papers

Concise and Organized Perception Facilitates Reasoning in Large Language Models