π€ AI Summary
This study addresses the lack of reliable evaluation benchmarks for large language models (LLMs) in low-resource, culturally specific domains such as Vietnamese Traditional Medicine (VTM), where model performance is significantly constrained. To bridge this gap, the authors propose a synthetic methodology integrating Retrieval-Augmented Generation (RAG) with dual-model consistency verification, augmented by substring-level evidence checking and expert review. This approach yields the first VTM evaluation benchmark comprising 3,190 questions spanning multiple difficulty levels. The dataset achieves a human validation rate of 94.2% (Fleissβ ΞΊ = 0.82). Empirical evaluation reveals that general-purpose LLMs with prior Chinese-language knowledge outperform models specifically trained on Vietnamese, highlighting the potential for cross-lingual conceptual transfer in specialized medical contexts.
π Abstract
Large Language Models (LLMs) have demonstrated remarkable proficiency in general medical domains. However, their performance significantly degrades in specialized, culturally specific domains such as Vietnamese Traditional Medicine (VTM), primarily due to the scarcity of high-quality, structured benchmarks. In this paper, we introduce VietMed-MCQ, a novel multiple-choice question dataset generated via a Retrieval-Augmented Generation (RAG) pipeline with an automated consistency check mechanism. Unlike previous synthetic datasets, our framework incorporates a dual-model validation approach to ensure reasoning consistency through independent answer verification, though the substring-based evidence checking has known limitations. The complete dataset of 3,190 questions spans three difficulty levels and underwent validation by one medical expert and four students, achieving 94.2 percent approval with substantial inter-rater agreement (Fleiss'kappa = 0.82). We benchmark seven open-source models on VietMed-MCQ. Results reveal that general-purpose models with strong Chinese priors outperform Vietnamese-centric models, highlighting cross-lingual conceptual transfer, while all models still struggle with complex diagnostic reasoning. Our code and dataset are publicly available to foster research in low-resource medical domains.