Tracking the Limits of Knowledge Propagation: How LLMs Fail at Multi-Step Reasoning with Conflicting Knowledge

📅 2026-01-21

📈 Citations: 0

✨ Influential: 0

career value

153K/year

🤖 AI Summary

Large language models struggle to effectively propagate updated information that conflicts with their pre-existing knowledge during multi-step reasoning, often leading to erroneous conclusions. To address this challenge, this work introduces TRACK, the first benchmark specifically designed to evaluate conflict-aware knowledge propagation across three reasoning-intensive domains: WIKI, CODE, and MATH. Through multi-turn, multi-conflict knowledge injection tasks, TRACK systematically assesses a model’s ability to reason correctly after incorporating new factual updates. Experimental results reveal that updating facts can paradoxically degrade reasoning performance and that even when knowledge is successfully integrated, models exhibit systematic flaws in subsequent reasoning steps. By disentangling failures due to inadequate knowledge integration from inherent deficiencies in reasoning mechanisms—and combining quantitative evaluation with error attribution analysis—this study offers a novel perspective for improving model reasoning robustness.

Technology Category

Application Category

📝 Abstract

A common solution for mitigating outdated or incorrect information in Large Language Models (LLMs) is to provide updated facts in-context or through knowledge editing. However, these methods introduce knowledge conflicts when the knowledge update fails to overwrite the model's parametric knowledge, which propagate to faulty reasoning. Current benchmarks for this problem, however, largely focus only on single knowledge updates and fact recall without evaluating how these updates affect downstream reasoning. In this work, we introduce TRACK (Testing Reasoning Amid Conflicting Knowledge), a new benchmark for studying how LLMs propagate new knowledge through multi-step reasoning when it conflicts with the model's initial parametric knowledge. Spanning three reasoning-intensive scenarios (WIKI, CODE, and MATH), TRACK introduces multiple, realistic conflicts to mirror real-world complexity. Our results on TRACK reveal that providing updated facts to models for reasoning can worsen performance compared to providing no updated facts to a model, and that this performance degradation exacerbates as more updated facts are provided. We show this failure stems from both inability to faithfully integrate updated facts, but also flawed reasoning even when knowledge is integrated. TRACK provides a rigorous new benchmark to measure and guide future progress on propagating conflicting knowledge in multi-step reasoning.

Problem

Research questions and friction points this paper is trying to address.

knowledge propagation

multi-step reasoning

conflicting knowledge

large language models

reasoning failure

Innovation

Methods, ideas, or system contributions that make the work stand out.

knowledge conflict

multi-step reasoning

large language models