🤖 AI Summary
This work investigates the robustness of large language models (LLMs) to irrelevant context (IC) interference in symbolic reasoning. To this end, we introduce GSM-DC, a controllable interference benchmark, and propose the first symbolic reasoning graph framework enabling precise, programmable IC injection. We further design a process-reward-guided stepwise tree search inference method and develop a strong interference-robust training strategy. Experiments reveal that LLMs are highly sensitive to IC: both reasoning path selection fidelity and arithmetic accuracy degrade concurrently. Strong interference training substantially improves out-of-distribution (OOD) generalization, boosting accuracy by up to 23.6%. Our tree search method achieves an 18.4% OOD accuracy gain over baselines. This is the first systematic quantification of IC interference effects in symbolic reasoning, providing an interpretable modeling framework and effective training/inference paradigms for enhancing LLM reasoning robustness.
📝 Abstract
We introduce Grade School Math with Distracting Context (GSM-DC), a synthetic benchmark to evaluate Large Language Models' (LLMs) reasoning robustness against systematically controlled irrelevant context (IC). GSM-DC constructs symbolic reasoning graphs with precise distractor injections, enabling rigorous, reproducible evaluation. Our experiments demonstrate that LLMs are significantly sensitive to IC, affecting both reasoning path selection and arithmetic accuracy. Additionally, training models with strong distractors improves performance in both in-distribution and out-of-distribution scenarios. We further propose a stepwise tree search guided by a process reward model, which notably enhances robustness in out-of-distribution conditions.