🤖 AI Summary
A high-quality, subject-specific multiple-choice question (MCQ) benchmark for Vietnamese education—particularly in mathematics, physics, chemistry, and biology—is lacking. Method: We introduce ViMMRC 2.0, the first rigorously LaTeX-formatted Vietnamese MCQ dataset for STEM disciplines, designed to evaluate large language models’ (LLMs’) ability to directly bind answer options (A/B/C/D) under zero-, one-, and few-shot settings. We propose a lightweight evaluation paradigm: extracting character-level answers from context-aware token probabilities, bypassing complex reasoning and reducing computational overhead. We also provide the first structured LaTeX annotation guideline and a cross-disciplinary MCQA benchmark. Contribution/Results: We systematically evaluate six mainstream LLMs—including BLOOMZ, LLaMA-2, and GPT series—on ViMMRC 1.0 and 2.0; GPT-4 achieves top performance. The dataset is publicly released, addressing a critical gap in Vietnamese-language AI-driven educational assessment.
📝 Abstract
In this paper, we evaluate the ability of large language models (LLMs) to perform multiple choice symbol binding (MCSB) for multiple choice question answering (MCQA) tasks in zero-shot, one-shot, and few-shot settings. We focus on Vietnamese, with fewer challenging MCQA datasets than in English. The two existing datasets, ViMMRC 1.0 and ViMMRC 2.0, focus on literature. Recent research in Vietnamese natural language processing (NLP) has focused on the Vietnamese National High School Graduation Examination (VNHSGE) from 2019 to 2023 to evaluate ChatGPT. However, these studies have mainly focused on how ChatGPT solves the VNHSGE step by step. We aim to create a novel and high-quality dataset by providing structured guidelines for typing LaTeX formulas for mathematics, physics, chemistry, and biology. This dataset can be used to evaluate the MCSB ability of LLMs and smaller language models (LMs) because it is typed in a strict LaTeX style. We determine the most probable character answer (A, B, C, or D) based on context, instead of finding the answer step by step as in previous Vietnamese works. This reduces computational costs and accelerates the evaluation of LLMs. Our evaluation of six well-known LLMs, namely BLOOMZ-7.1B-MT, LLaMA-2-7B, LLaMA-2-70B, GPT-3, GPT-3.5, and GPT-4.0, on the ViMMRC 1.0 and ViMMRC 2.0 benchmarks and our proposed dataset shows promising results on the MCSB ability of LLMs for Vietnamese. The dataset is available1 for research purposes only.