DSC2025 - ViHallu Challenge: Detecting Hallucination in Vietnamese LLMs

📅 2026-01-08
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of standardized evaluation for hallucination detection in large language models (LLMs) within low-resource languages by introducing ViHallu Challenge, the first shared task dedicated to Vietnamese hallucination detection. The authors release the ViHallu dataset, comprising 10,000 annotated triplets categorized into non-hallucinated, intrinsic, and extrinsic hallucinations. To establish a fine-grained and robust benchmark, they incorporate factual, noisy, and adversarial prompts. Employing instruction tuning, structured prompt engineering, and ensemble learning strategies for hallucination identification, their best-performing system achieves a macro F1-score of 84.80%, substantially outperforming the baseline of 32.83%. While the results validate the effectiveness of the proposed approach, they also highlight the persistent challenge of detecting intrinsic hallucinations.

Technology Category

Application Category

📝 Abstract
The reliability of large language models (LLMs) in production environments remains significantly constrained by their propensity to generate hallucinations -- fluent, plausible-sounding outputs that contradict or fabricate information. While hallucination detection has recently emerged as a priority in English-centric benchmarks, low-to-medium resource languages such as Vietnamese remain inadequately covered by standardized evaluation frameworks. This paper introduces the DSC2025 ViHallu Challenge, the first large-scale shared task for detecting hallucinations in Vietnamese LLMs. We present the ViHallu dataset, comprising 10,000 annotated triplets of (context, prompt, response) samples systematically partitioned into three hallucination categories: no hallucination, intrinsic, and extrinsic hallucinations. The dataset incorporates three prompt types -- factual, noisy, and adversarial -- to stress-test model robustness. A total of 111 teams participated, with the best-performing system achieving a macro-F1 score of 84.80\%, compared to a baseline encoder-only score of 32.83\%, demonstrating that instruction-tuned LLMs with structured prompting and ensemble strategies substantially outperform generic architectures. However, the gap to perfect performance indicates that hallucination detection remains a challenging problem, particularly for intrinsic (contradiction-based) hallucinations. This work establishes a rigorous benchmark and explores a diverse range of detection methodologies, providing a foundation for future research into the trustworthiness and reliability of Vietnamese language AI systems.
Problem

Research questions and friction points this paper is trying to address.

hallucination detection
Vietnamese LLMs
low-resource languages
intrinsic hallucination
extrinsic hallucination
Innovation

Methods, ideas, or system contributions that make the work stand out.

hallucination detection
Vietnamese LLMs
ViHallu dataset
structured prompting
shared task
🔎 Similar Papers
No similar papers found.
Anh Thi-Hoang Nguyen
Anh Thi-Hoang Nguyen
University of Information Technology, Vietnam National University Ho Chi Minh City (VNU-HCM)
AIData ScienceMachine LearningDeep LearningNatural Language Processing
K
Khanh Quoc Tran
Faculty of Information Science and Engineering, University of Information Technology
T
T. Huynh
Faculty of Information Science and Engineering, University of Information Technology
P
Phuoc Tan-Hoang Nguyen
Faculty of Information Science and Engineering, University of Information Technology
C
Cam Tan Nguyen
Faculty of Information Science and Engineering, University of Information Technology
Kiet Van Nguyen
Kiet Van Nguyen
University of Information Technology, VNU-HCM
Data ScienceArtificial IntelligenceComputational Linguistics