K-FinHallu: A Hallucination Detection Benchmark for Multi-Turn RAG in Korean Finance

📅 2026-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing hallucination detection benchmarks are largely confined to single-turn English settings and struggle to capture the linguistic and regulatory complexities inherent in Korean financial multi-turn dialogues. This work introduces the first hallucination detection benchmark tailored for Korean financial multi-turn retrieval-augmented generation (RAG), constructing realistic dialogues grounded in authentic financial documents, injecting fine-grained hallucinations according to a hierarchical answerability schema, and explicitly modeling appropriate refusal behaviors. We propose a domain-specific hallucination taxonomy and a corresponding benchmark dataset to evaluate the detection capabilities of prominent large language models. Experimental results reveal that state-of-the-art models exhibit significant shortcomings in fine-grained financial reasoning and justified refusals, whereas an 8B-parameter model fine-tuned on our benchmark achieves performance comparable to leading models.
📝 Abstract
Large Language Models (LLMs) have advanced financial automation through Retrieval-Augmented Generation (RAG), yet hallucinations remain a critical barrier to deployment in high-stakes environments. Existing benchmarks focus on single-turn, English-centric tasks, leaving the multi-turn dynamics and linguistic-regulatory nuances of the Korean financial domain unaddressed. We introduce K-FinHallu, the first benchmark for hallucination detection in multi-turn Korean financial RAG. We construct multi-turn dialogues from authentic Korean financial documents and inject hallucinations under a proposed hierarchical taxonomy based on context answerability that explicitly accounts for justified abstention. Benchmarking frontier and open-source LLMs as hallucination detectors, we find that even the strongest models struggle with fine-grained financial diagnostics and refusal behavior. While fine-tuning an 8B model on our training split yields performance competitive with frontier LLMs, justified abstention remains the weakest axis across all evaluated models.
Problem

Research questions and friction points this paper is trying to address.

hallucination detection
multi-turn RAG
Korean finance
LLM benchmarking
justified abstention
Innovation

Methods, ideas, or system contributions that make the work stand out.

hallucination detection
multi-turn RAG
Korean finance
justified abstention
benchmark