Consensus or Conflict? Fine-Grained Evaluation of Conflicting Answers in Question-Answering

📅 2025-08-17

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

Existing multi-answer question answering (MAQA) research predominantly assumes answer consistency, failing to model semantic conflicts among answers prevalent in real-world scenarios; moreover, high-quality, fine-grained, conflict-annotated benchmarks are lacking. Method: We propose a conflict-aware MAQA paradigm requiring models to jointly identify all valid answers and precisely detect conflicting answer pairs. To support this, we introduce NATCONFQA—the first realistic, human-verified, structurally annotated conflict-aware QA dataset—covering diverse semantic conflict types. Its construction employs a low-cost pipeline integrating fact-checking resources with rigorous human validation. Contribution/Results: Evaluating eight state-of-the-art large language models on NATCONFQA reveals substantial deficiencies in conflict detection and logical consistency reasoning. Our work establishes a novel evaluation dimension for assessing model logical robustness and provides empirical grounding for future research on conflict-aware reasoning.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have demonstrated strong performance in question answering (QA) tasks. However, Multi-Answer Question Answering (MAQA), where a question may have several valid answers, remains challenging. Traditional QA settings often assume consistency across evidences, but MAQA can involve conflicting answers. Constructing datasets that reflect such conflicts is costly and labor-intensive, while existing benchmarks often rely on synthetic data, restrict the task to yes/no questions, or apply unverified automated annotation. To advance research in this area, we extend the conflict-aware MAQA setting to require models not only to identify all valid answers, but also to detect specific conflicting answer pairs, if any. To support this task, we introduce a novel cost-effective methodology for leveraging fact-checking datasets to construct NATCONFQA, a new benchmark for realistic, conflict-aware MAQA, enriched with detailed conflict labels, for all answer pairs. We evaluate eight high-end LLMs on NATCONFQA, revealing their fragility in handling various types of conflicts and the flawed strategies they employ to resolve them.

Problem

Research questions and friction points this paper is trying to address.

Evaluating conflicting answers in multi-answer QA tasks

Constructing realistic conflict-aware QA benchmarks cost-effectively

Assessing LLM performance in detecting and resolving answer conflicts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends MAQA to detect conflicting answer pairs

Leverages fact-checking datasets for cost-effective benchmarking

Introduces NATCONFQA with detailed conflict labels

🔎 Similar Papers

QA-Calibration of Language Model Confidence Scores