Do Large Language Models Latently Perform Multi-Hop Reasoning?

📅 2024-02-26

🏛️ Annual Meeting of the Association for Computational Linguistics

📈 Citations: 97

✨ Influential: 9

career value

194K/year

🤖 AI Summary

This work investigates whether large language models (LLMs) implicitly perform multi-hop reasoning—e.g., inferring “mother of the singer of ‘Superstition’” by first identifying Stevie Wonder and then retrieving his mother’s information—within their internal representations. Method: We employ causal intervention–based prompt engineering, knowledge recall quantification, cross-relation comparative analysis, and model scaling law evaluation to probe implicit reasoning paths. Contribution/Results: We provide the first systematic empirical validation that LLMs harbor implicit multi-hop reasoning structures: the first hop (bridge entity identification) is robust and scales strongly with model size; the second hop (knowledge retrieval) is fragile and highly context-dependent. In certain relation types, over 80% of samples exhibit implicit multi-hop structure, yet end-to-end chain accuracy remains only moderate—revealing non-uniform reasoning capability and strong contextual sensitivity. Our framework establishes a verifiable empirical methodology and a new benchmark for probing LLM reasoning mechanisms.

Technology Category

Application Category

📝 Abstract

We study whether Large Language Models (LLMs) latently perform multi-hop reasoning with complex prompts such as"The mother of the singer of 'Superstition' is". We look for evidence of a latent reasoning pathway where an LLM (1) latently identifies"the singer of 'Superstition'"as Stevie Wonder, the bridge entity, and (2) uses its knowledge of Stevie Wonder's mother to complete the prompt. We analyze these two hops individually and consider their co-occurrence as indicative of latent multi-hop reasoning. For the first hop, we test if changing the prompt to indirectly mention the bridge entity instead of any other entity increases the LLM's internal recall of the bridge entity. For the second hop, we test if increasing this recall causes the LLM to better utilize what it knows about the bridge entity. We find strong evidence of latent multi-hop reasoning for the prompts of certain relation types, with the reasoning pathway used in more than 80% of the prompts. However, the utilization is highly contextual, varying across different types of prompts. Also, on average, the evidence for the second hop and the full multi-hop traversal is rather moderate and only substantial for the first hop. Moreover, we find a clear scaling trend with increasing model size for the first hop of reasoning but not for the second hop. Our experimental findings suggest potential challenges and opportunities for future development and applications of LLMs.

Problem

Research questions and friction points this paper is trying to address.

Study if LLMs perform latent multi-hop reasoning in complex prompts

Analyze evidence of two-step reasoning via bridge entity identification

Examine contextual variability and scaling trends in reasoning performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzes latent multi-hop reasoning in LLMs

Tests recall and utilization of bridge entities

Evaluates reasoning scalability with model size

🔎 Similar Papers

No similar papers found.