DSC2025 - ViHallu Challenge: Detecting Hallucination in Vietnamese LLMs

📅 2026-01-08

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the lack of standardized evaluation for hallucination detection in large language models (LLMs) within low-resource languages by introducing ViHallu Challenge, the first shared task dedicated to Vietnamese hallucination detection. The authors release the ViHallu dataset, comprising 10,000 annotated triplets categorized into non-hallucinated, intrinsic, and extrinsic hallucinations. To establish a fine-grained and robust benchmark, they incorporate factual, noisy, and adversarial prompts. Employing instruction tuning, structured prompt engineering, and ensemble learning strategies for hallucination identification, their best-performing system achieves a macro F1-score of 84.80%, substantially outperforming the baseline of 32.83%. While the results validate the effectiveness of the proposed approach, they also highlight the persistent challenge of detecting intrinsic hallucinations.

Technology Category

Application Category

📝 Abstract

The reliability of large language models (LLMs) in production environments remains significantly constrained by their propensity to generate hallucinations -- fluent, plausible-sounding outputs that contradict or fabricate information. While hallucination detection has recently emerged as a priority in English-centric benchmarks, low-to-medium resource languages such as Vietnamese remain inadequately covered by standardized evaluation frameworks. This paper introduces the DSC2025 ViHallu Challenge, the first large-scale shared task for detecting hallucinations in Vietnamese LLMs. We present the ViHallu dataset, comprising 10,000 annotated triplets of (context, prompt, response) samples systematically partitioned into three hallucination categories: no hallucination, intrinsic, and extrinsic hallucinations. The dataset incorporates three prompt types -- factual, noisy, and adversarial -- to stress-test model robustness. A total of 111 teams participated, with the best-performing system achieving a macro-F1 score of 84.80\%, compared to a baseline encoder-only score of 32.83\%, demonstrating that instruction-tuned LLMs with structured prompting and ensemble strategies substantially outperform generic architectures. However, the gap to perfect performance indicates that hallucination detection remains a challenging problem, particularly for intrinsic (contradiction-based) hallucinations. This work establishes a rigorous benchmark and explores a diverse range of detection methodologies, providing a foundation for future research into the trustworthiness and reliability of Vietnamese language AI systems.

Problem

Research questions and friction points this paper is trying to address.

hallucination detection

Vietnamese LLMs

low-resource languages

intrinsic hallucination

extrinsic hallucination

Innovation

Methods, ideas, or system contributions that make the work stand out.

hallucination detection

Vietnamese LLMs

ViHallu dataset

structured prompting

shared task

🔎 Similar Papers

No similar papers found.

Authors to Follow