FineDialFact: A benchmark for Fine-grained Dialogue Fact Verification

📅 2025-08-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing dialogue hallucination detection methods assign a single factual label to entire responses, failing to capture the fine-grained mixture of accurate, erroneous, and unverifiable claims within a single utterance. To address this limitation, we propose the novel task of fine-grained dialogue fact verification and introduce FineDialFact—the first dedicated benchmark for this task. FineDialFact decomposes dialogue responses into atomic factual units and provides per-unit truth annotations. Constructed from public open-domain dialogue datasets (e.g., HybriDialogue), it supports evaluation of reasoning paradigms such as chain-of-thought (CoT). Empirical results show that even state-of-the-art models achieve only 0.75 F1, underscoring the task’s difficulty and significance. We publicly release the dataset and code to establish foundational resources for fine-grained factual verification in dialogue systems.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) are known to produce hallucinations - factually incorrect or fabricated information - which poses significant challenges for many Natural Language Processing (NLP) applications, such as dialogue systems. As a result, detecting hallucinations has become a critical area of research. Current approaches to hallucination detection in dialogue systems primarily focus on verifying the factual consistency of generated responses. However, these responses often contain a mix of accurate, inaccurate or unverifiable facts, making one factual label overly simplistic and coarse-grained. In this paper, we introduce a benchmark, FineDialFact, for fine-grained dialogue fact verification, which involves verifying atomic facts extracted from dialogue responses. To support this, we construct a dataset based on publicly available dialogue datasets and evaluate it using various baseline methods. Experimental results demonstrate that methods incorporating Chain-of-Thought (CoT) reasoning can enhance performance in dialogue fact verification. Despite this, the best F1-score achieved on the HybriDialogue, an open-domain dialogue dataset, is only 0.75, indicating that the benchmark remains a challenging task for future research. Our dataset and code will be public on GitHub.
Problem

Research questions and friction points this paper is trying to address.

Detecting hallucinations in dialogue systems' responses
Verifying fine-grained atomic facts in dialogues
Improving fact verification using Chain-of-Thought reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained dialogue fact verification benchmark
Chain-of-Thought reasoning enhances verification
Dataset based on public dialogue sources
🔎 Similar Papers
No similar papers found.