The Two-Hop Curse: LLMs trained on A$ ightarrow$B, B$ ightarrow$C fail to learn A$ ightarrow$C

📅 2024-11-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper identifies a “two-hop reasoning failure” in large language models (LLMs) when chain-of-thought (CoT) prompting is absent: LLMs cannot implicitly compose two disjoint facts acquired from separate documents (e.g., A→B and B→C) to infer A→C, degrading accuracy to chance level. Method: The authors introduce the first controllable two-hop reasoning benchmark and conduct systematic evaluation via supervised fine-tuning (e.g., Llama-3-8B-Instruct, GPT-4o), counterfactual fact injection, prompt-based fact isolation, and real-world two-hop question answering. Contribution/Results: Experiments reveal that nine state-of-the-art LLMs achieve chance-level accuracy on over half of real-world question categories without CoT—demonstrating severe reliance on factual co-occurrence rather than logical composition, thereby challenging assumptions of latent general-purpose reasoning capability. Enabling CoT restores performance significantly, confirming it as a critical mechanism for activating implicit multi-hop inference.

Technology Category

Application Category

📝 Abstract
[Notice: This version is outdated. Recent research contradicts some key claims; we are working on a major revision with more nuanced analysis. Please wait for the updated version.] While LLMs excel at multi-hop questions (e.g."Who is the spouse of the performer of Imagine?") when using chain-of-thought reasoning (CoT), they struggle when forced to reason internally (without CoT). Previous work on the size and nature of this gap produced mixed evidence with inconclusive results. In this paper, we introduce a controlled setting for investigating two-hop reasoning in LLMs, where the above-chance performance constitutes undeniable evidence for latent reasoning. We fine-tune LLMs (including Llama 3 8B Instruct and GPT-4o) on fictional facts and confirm that they generalize to answering two-hop questions about them using CoT. We find that models can perform latent reasoning when facts appear together during training or in the prompt. However, to our surprise, models completely fail at two-hop reasoning without CoT when learned facts only appear in different documents, achieving chance-level accuracy and chance-level test loss. We call this complete failure to compose separately learned facts the Two-Hop Curse. Moreover, we evaluate 9 frontier LLMs on real-world facts, finding that models completely fail at two-hop no-CoT reasoning for over half of question categories while maintaining partial success with CoT across most categories. These results suggest that LLMs lack a general capability for latent multi-hop reasoning independent of the question type.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Two-hop Reasoning
Chain-of-thought Reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-Hop Curse
Latent Multi-Hop Reasoning
Chain-of-Thought Reasoning
🔎 Similar Papers
No similar papers found.