🤖 AI Summary
This work addresses the entanglement of parametric knowledge memory and genuine reasoning capabilities in large language models (LLMs).
Method: We propose SynthWorlds—a framework that constructs two structurally identical but semantically disjoint synthetic parallel corpora, coupled with mirrored multi-hop question answering and page navigation tasks. By rigorously controlling reasoning complexity, we ensure parametric knowledge fails in the target world. Knowledge–reasoning decoupling is achieved via closed-book QA, retrieval-augmented evaluation, and automated synthetic world generation.
Contribution/Results: Experiments reveal a stable, measurable “knowledge advantage gap” across models—demonstrating persistent reliance on memorized knowledge despite existing mitigation strategies. This work establishes the first quantitative, reproducible evaluation paradigm for reasoning capability, enabling fine-grained, interpretable analysis of reasoning mechanisms and paving the way for targeted architectural and training optimizations.
📝 Abstract
Evaluating the reasoning ability of language models (LMs) is complicated by their extensive parametric world knowledge, where benchmark performance often reflects factual recall rather than genuine reasoning. Existing datasets and approaches (e.g., temporal filtering, paraphrasing, adversarial substitution) cannot cleanly separate the two. We present SynthWorlds, a framework that disentangles task reasoning complexity from factual knowledge. In SynthWorlds, we construct parallel corpora representing two worlds with identical interconnected structure: a real-mapped world, where models may exploit parametric knowledge, and a synthetic-mapped world, where such knowledge is meaningless. On top of these corpora, we design two mirrored tasks as case studies: multi-hop question answering and page navigation, which maintain equal reasoning difficulty across worlds. Experiments in parametric-only (e.g., closed-book QA) and knowledge-augmented (e.g., retrieval-augmented) LM settings reveal a persistent knowledge advantage gap, defined as the performance boost models gain from memorized parametric world knowledge. Knowledge acquisition and integration mechanisms reduce but do not eliminate this gap, highlighting opportunities for system improvements. Fully automatic and scalable, SynthWorlds provides a controlled environment for evaluating LMs in ways that were previously challenging, enabling precise and testable comparisons of reasoning and memorization.