VM14K: First Vietnamese Medical Benchmark

📅 2025-06-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the lack of verifiable, standardized medical evaluation benchmarks for low-resource, non-English-speaking communities—particularly Vietnam—this work introduces VM14K, the first Vietnamese-language medical multiple-choice question benchmark. VM14K comprises 14,000 clinically grounded questions spanning 34 medical specialties, stratified into four difficulty levels. The dataset integrates items from Vietnamese physician licensure examinations and anonymized clinical records, with rigorous annotation by domain experts and multi-source cross-validation. We propose an extensible, open-source benchmark construction framework and a principled difficulty modeling methodology. VM14K adopts a three-stage release strategy: a 4k public sample set, a fully open 10k training set, and a 2k private test set—enabling comprehensive model evaluation and fair, reproducible comparison. This benchmark fills a critical gap in medical AI evaluation for low-resource languages and establishes the first authoritative, transparent, and reproducible foundation for developing and assessing Vietnamese large language models in healthcare.

Technology Category

Application Category

📝 Abstract
Medical benchmarks are indispensable for evaluating the capabilities of language models in healthcare for non-English-speaking communities,therefore help ensuring the quality of real-life applications. However, not every community has sufficient resources and standardized methods to effectively build and design such benchmark, and available non-English medical data is normally fragmented and difficult to verify. We developed an approach to tackle this problem and applied it to create the first Vietnamese medical question benchmark, featuring 14,000 multiple-choice questions across 34 medical specialties. Our benchmark was constructed using various verifiable sources, including carefully curated medical exams and clinical records, and eventually annotated by medical experts. The benchmark includes four difficulty levels, ranging from foundational biological knowledge commonly found in textbooks to typical clinical case studies that require advanced reasoning. This design enables assessment of both the breadth and depth of language models' medical understanding in the target language thanks to its extensive coverage and in-depth subject-specific expertise. We release the benchmark in three parts: a sample public set (4k questions), a full public set (10k questions), and a private set (2k questions) used for leaderboard evaluation. Each set contains all medical subfields and difficulty levels. Our approach is scalable to other languages, and we open-source our data construction pipeline to support the development of future multilingual benchmarks in the medical domain.
Problem

Research questions and friction points this paper is trying to address.

Lack of Vietnamese medical benchmarks for evaluating language models
Fragmented and unverified non-English medical data sources
Need for scalable method to create multilingual medical benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vietnamese medical benchmark with 14k questions
Diverse verifiable sources and expert annotations
Scalable pipeline for multilingual medical benchmarks
🔎 Similar Papers
No similar papers found.
T
Thong Nguyen
Vietnam National University
Duc Nguyen
Duc Nguyen
Dickinson College
Computer Science
M
Minh Dang
Columbia University
T
Thai Dao
Venera AI
Long Nguyen
Long Nguyen
Graduate Student, Carnegie Mellon University
biological and biomedical sciencesdigital pathologycomputational microscopy
Q
Quan H. Nguyen
University of Maryland
Dat Nguyen
Dat Nguyen
Postdoc - Harvard, Basis Institute
Graph Neural NetworkProgram AnalysisSoftware EngineeringProgram SynthesisComputer Vision
K
Kien Tran
Venera AI
M
Minh-Nam Tran
Foreign Trade University