Shattering the Shortcut: A Topology-Regularized Benchmark for Multi-hop Medical Reasoning in LLMs

📅 2026-03-12

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work addresses the limitation of current large language models (LLMs) in medical question answering, which often rely on single-hop factual recall and fail to perform the multi-hop diagnostic reasoning required in real-world clinical settings. A key challenge is the tendency of models to exploit shortcut pathways through generic hub nodes—such as “inflammation”—in medical knowledge graphs. To mitigate this, the authors introduce ShatterMed-QA, a bilingual benchmark comprising 10,558 multi-hop clinical questions, along with a novel k-Shattering algorithm that prunes knowledge graphs to eliminate logical shortcuts. The framework further incorporates implicit bridge-entity masking and topology-driven hard negative sampling, establishing the first shortcut-resistant evaluation environment for multi-hop medical reasoning. Experiments across 21 LLMs reveal generally poor multi-hop reasoning capabilities, while retrieval-augmented generation (RAG) substantially restores performance, underscoring the benchmark’s validity and necessity.

Technology Category

Application Category

📝 Abstract

While Large Language Models (LLMs) achieve expert-level performance on standard medical benchmarks through single-hop factual recall, they severely struggle with the complex, multi-hop diagnostic reasoning required in real-world clinical settings. A primary obstacle is "shortcut learning", where models exploit highly connected, generic hub nodes (e.g., "inflammation") in knowledge graphs to bypass authentic micro-pathological cascades. To address this, we introduce ShatterMed-QA, a bilingual benchmark of 10,558 multi-hop clinical questions designed to rigorously evaluate deep diagnostic reasoning. Our framework constructs a topology-regularized medical Knowledge Graph using a novel $k$-Shattering algorithm, which physically prunes generic hubs to explicitly sever logical shortcuts. We synthesize the evaluation vignettes by applying implicit bridge entity masking and topology-driven hard negative sampling, forcing models to navigate biologically plausible distractors without relying on superficial elimination. Comprehensive evaluations of 21 LLMs reveal massive performance degradation on our multi-hop tasks, particularly among domain-specific models. Crucially, restoring the masked evidence via Retrieval-Augmented Generation (RAG) triggers near-universal performance recovery, validating ShatterMed-QA's structural fidelity and proving its efficacy in diagnosing the fundamental reasoning deficits of current medical AI. Explore the dataset, interactive examples, and full leaderboards at our project website: https://shattermed-qa-web.vercel.app/

Problem

Research questions and friction points this paper is trying to address.

multi-hop reasoning

shortcut learning

medical AI

diagnostic reasoning

knowledge graph

Innovation

Methods, ideas, or system contributions that make the work stand out.

k-Shattering algorithm

topology-regularized knowledge graph

multi-hop medical reasoning