Multi-Hop Question Answering: When Can Humans Help, and Where do They Struggle?

📅 2025-10-06

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Multi-hop question answering (QA) demands identifying multi-hop reasoning requirements, performing reading comprehension and logical inference, and integrating knowledge across documents—posing significant challenges for both humans and large language models. This study conducts a crowdsourced experiment to quantitatively assess human performance across constituent subtasks: while knowledge integration achieves high accuracy (97%), multi-hop requirement identification drops to 67%, and semantic mismatches (e.g., answering “where” with “when”) persist in both single- and multi-hop QA. Results reveal that humans excel at cross-document knowledge fusion but struggle with reasoning path planning and semantic alignment. Based on these findings, we propose a “complementary human-AI collaboration” design paradigm: AI handles reasoning trigger identification and semantic constraint enforcement, while humans lead knowledge integration. This work provides empirical grounding and architectural guidance for building interpretable, robust hybrid intelligence systems.

Technology Category

Application Category

📝 Abstract

Multi-hop question answering is a challenging task for both large language models (LLMs) and humans, as it requires recognizing when multi-hop reasoning is needed, followed by reading comprehension, logical reasoning, and knowledge integration. To better understand how humans might collaborate effectively with AI, we evaluate the performance of crowd workers on these individual reasoning subtasks. We find that while humans excel at knowledge integration (97% accuracy), they often fail to recognize when a question requires multi-hop reasoning (67% accuracy). Participants perform reasonably well on both single-hop and multi-hop QA (84% and 80% accuracy, respectively), but frequently make semantic mistakes--for example, answering "when" an event happened when the question asked "where." These findings highlight the importance of designing AI systems that complement human strengths while compensating for common weaknesses.

Problem

Research questions and friction points this paper is trying to address.

Evaluating human performance on multi-hop reasoning subtasks in QA

Identifying human strengths in knowledge integration versus reasoning recognition

Designing AI systems to complement human strengths and compensate weaknesses

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluating human performance on reasoning subtasks

Identifying human strengths in knowledge integration

Designing AI systems to complement human weaknesses

🔎 Similar Papers

Ranking Generated Answers: On the Agreement of Retrieval Models with Humans on Consumer Health Questions