🤖 AI Summary
Existing hallucination detection benchmarks are largely confined to single-turn English settings and struggle to capture the linguistic and regulatory complexities inherent in Korean financial multi-turn dialogues. This work introduces the first hallucination detection benchmark tailored for Korean financial multi-turn retrieval-augmented generation (RAG), constructing realistic dialogues grounded in authentic financial documents, injecting fine-grained hallucinations according to a hierarchical answerability schema, and explicitly modeling appropriate refusal behaviors. We propose a domain-specific hallucination taxonomy and a corresponding benchmark dataset to evaluate the detection capabilities of prominent large language models. Experimental results reveal that state-of-the-art models exhibit significant shortcomings in fine-grained financial reasoning and justified refusals, whereas an 8B-parameter model fine-tuned on our benchmark achieves performance comparable to leading models.
📝 Abstract
Large Language Models (LLMs) have advanced financial automation through Retrieval-Augmented Generation (RAG), yet hallucinations remain a critical barrier to deployment in high-stakes environments. Existing benchmarks focus on single-turn, English-centric tasks, leaving the multi-turn dynamics and linguistic-regulatory nuances of the Korean financial domain unaddressed. We introduce K-FinHallu, the first benchmark for hallucination detection in multi-turn Korean financial RAG. We construct multi-turn dialogues from authentic Korean financial documents and inject hallucinations under a proposed hierarchical taxonomy based on context answerability that explicitly accounts for justified abstention. Benchmarking frontier and open-source LLMs as hallucination detectors, we find that even the strongest models struggle with fine-grained financial diagnostics and refusal behavior. While fine-tuning an 8B model on our training split yields performance competitive with frontier LLMs, justified abstention remains the weakest axis across all evaluated models.