MAGIC: A Multi-Hop and Graph-Based Benchmark for Inter-Context Conflicts in Retrieval-Augmented Generation

πŸ“… 2025-07-29
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing RAG knowledge conflict benchmarks are limited to single-hop QA, entity-replacement-based construction, and narrow conflict categories, hindering systematic study of multi-hop reasoning scenarios. To address this, we propose KG-Conflictβ€”the first knowledge graph (KG)-based multi-hop knowledge conflict benchmark. It leverages KG structural semantics to generate multi-document contexts that are semantically similar yet factually conflicting, covering diverse, human-annotated, and interpretable conflict types. All instances undergo rigorous manual validation for logical coherence and linguistic naturalness. KG-Conflict overcomes three key limitations: task scope (extending beyond single-hop), conflict diversity (encompassing cross-context, multi-hop conflicts), and interpretability (providing explicit conflict rationales). Experiments reveal that state-of-the-art LLMs perform poorly on both conflict detection and localization, exposing fundamental weaknesses in resolving contradictory information during multi-hop reasoning.

Technology Category

Application Category

πŸ“ Abstract
Knowledge conflict often arises in retrieval-augmented generation (RAG) systems, where retrieved documents may be inconsistent with one another or contradict the model's parametric knowledge. Existing benchmarks for investigating the phenomenon have notable limitations, including a narrow focus on the question answering setup, heavy reliance on entity substitution techniques, and a restricted range of conflict types. To address these issues, we propose a knowledge graph (KG)-based framework that generates varied and subtle conflicts between two similar yet distinct contexts, while ensuring interpretability through the explicit relational structure of KGs. Experimental results on our benchmark, MAGIC, provide intriguing insights into the inner workings of LLMs regarding knowledge conflict: both open-source and proprietary models struggle with conflict detection -- especially when multi-hop reasoning is required -- and often fail to pinpoint the exact source of contradictions. Finally, we present in-depth analyses that serve as a foundation for improving LLMs in integrating diverse, sometimes even conflicting, information.
Problem

Research questions and friction points this paper is trying to address.

Addresses knowledge conflicts in retrieval-augmented generation systems
Overcomes limitations in existing benchmarks for conflict analysis
Enhances LLMs' ability to detect and resolve multi-hop conflicts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Knowledge graph-based framework for conflict generation
Multi-hop reasoning for detecting subtle conflicts
Explicit relational structure ensures interpretability
πŸ”Ž Similar Papers
No similar papers found.
J
Jungyeon Lee
Hanyang University, Seoul, Republic of Korea
K
Kangmin Lee
Hanyang University, Seoul, Republic of Korea
Taeuk Kim
Taeuk Kim
Assistant Professor, Hanyang University.
Natural Language ProcessingLarge Language ModelsMachine Learning