🤖 AI Summary
Existing dialogue hallucination detection methods assign a single factual label to entire responses, failing to capture the fine-grained mixture of accurate, erroneous, and unverifiable claims within a single utterance. To address this limitation, we propose the novel task of fine-grained dialogue fact verification and introduce FineDialFact—the first dedicated benchmark for this task. FineDialFact decomposes dialogue responses into atomic factual units and provides per-unit truth annotations. Constructed from public open-domain dialogue datasets (e.g., HybriDialogue), it supports evaluation of reasoning paradigms such as chain-of-thought (CoT). Empirical results show that even state-of-the-art models achieve only 0.75 F1, underscoring the task’s difficulty and significance. We publicly release the dataset and code to establish foundational resources for fine-grained factual verification in dialogue systems.
📝 Abstract
Large Language Models (LLMs) are known to produce hallucinations - factually incorrect or fabricated information - which poses significant challenges for many Natural Language Processing (NLP) applications, such as dialogue systems. As a result, detecting hallucinations has become a critical area of research. Current approaches to hallucination detection in dialogue systems primarily focus on verifying the factual consistency of generated responses. However, these responses often contain a mix of accurate, inaccurate or unverifiable facts, making one factual label overly simplistic and coarse-grained. In this paper, we introduce a benchmark, FineDialFact, for fine-grained dialogue fact verification, which involves verifying atomic facts extracted from dialogue responses. To support this, we construct a dataset based on publicly available dialogue datasets and evaluate it using various baseline methods. Experimental results demonstrate that methods incorporating Chain-of-Thought (CoT) reasoning can enhance performance in dialogue fact verification. Despite this, the best F1-score achieved on the HybriDialogue, an open-domain dialogue dataset, is only 0.75, indicating that the benchmark remains a challenging task for future research. Our dataset and code will be public on GitHub.