Evaluating the Symbol Binding Ability of Large Language Models for Multiple-Choice Questions in Vietnamese General Education

📅 2023-10-18
🏛️ Symposium on Information and Communication Technology
📈 Citations: 4
Influential: 1
📄 PDF
🤖 AI Summary
A high-quality, subject-specific multiple-choice question (MCQ) benchmark for Vietnamese education—particularly in mathematics, physics, chemistry, and biology—is lacking. Method: We introduce ViMMRC 2.0, the first rigorously LaTeX-formatted Vietnamese MCQ dataset for STEM disciplines, designed to evaluate large language models’ (LLMs’) ability to directly bind answer options (A/B/C/D) under zero-, one-, and few-shot settings. We propose a lightweight evaluation paradigm: extracting character-level answers from context-aware token probabilities, bypassing complex reasoning and reducing computational overhead. We also provide the first structured LaTeX annotation guideline and a cross-disciplinary MCQA benchmark. Contribution/Results: We systematically evaluate six mainstream LLMs—including BLOOMZ, LLaMA-2, and GPT series—on ViMMRC 1.0 and 2.0; GPT-4 achieves top performance. The dataset is publicly released, addressing a critical gap in Vietnamese-language AI-driven educational assessment.
📝 Abstract
In this paper, we evaluate the ability of large language models (LLMs) to perform multiple choice symbol binding (MCSB) for multiple choice question answering (MCQA) tasks in zero-shot, one-shot, and few-shot settings. We focus on Vietnamese, with fewer challenging MCQA datasets than in English. The two existing datasets, ViMMRC 1.0 and ViMMRC 2.0, focus on literature. Recent research in Vietnamese natural language processing (NLP) has focused on the Vietnamese National High School Graduation Examination (VNHSGE) from 2019 to 2023 to evaluate ChatGPT. However, these studies have mainly focused on how ChatGPT solves the VNHSGE step by step. We aim to create a novel and high-quality dataset by providing structured guidelines for typing LaTeX formulas for mathematics, physics, chemistry, and biology. This dataset can be used to evaluate the MCSB ability of LLMs and smaller language models (LMs) because it is typed in a strict LaTeX style. We determine the most probable character answer (A, B, C, or D) based on context, instead of finding the answer step by step as in previous Vietnamese works. This reduces computational costs and accelerates the evaluation of LLMs. Our evaluation of six well-known LLMs, namely BLOOMZ-7.1B-MT, LLaMA-2-7B, LLaMA-2-70B, GPT-3, GPT-3.5, and GPT-4.0, on the ViMMRC 1.0 and ViMMRC 2.0 benchmarks and our proposed dataset shows promising results on the MCSB ability of LLMs for Vietnamese. The dataset is available1 for research purposes only.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' symbol binding for Vietnamese multiple-choice questions
Creating a LaTeX-structured dataset for math and science MCQA
Assessing LLMs' ability to predict correct answer characters (A-D)
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates LLMs on Vietnamese multiple-choice questions
Creates LaTeX-structured dataset for STEM subjects
Tests six LLMs on symbol binding ability
🔎 Similar Papers
No similar papers found.
Duc-Vu Nguyen
Duc-Vu Nguyen
University of Information Technology
Natural Language Processing
Q
Quoc-Nam Nguyen
University of Information Technology, Ho Chi Minh City, Vietnam, Vietnam National University, Ho Chi Minh City, Vietnam