Multi-TW: Benchmarking Multimodal Models on Traditional Chinese Question Answering in Taiwan

📅 2025-08-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal large language model (MLLM) benchmarks lack systematic evaluation of Traditional Chinese trilingual (image/audio/text) co-understanding capabilities and neglect inference latency. Method: We introduce Multi-TW, the first MLLM benchmark tailored for Traditional Chinese in Taiwan, constructed from the official SC-TOP language assessment to yield 900 image-text and audio-text multiple-choice questions. It enables joint evaluation of both accuracy and inference latency for any-to-any multimodal models. Contributions/Results: (1) First unified trilingual (visual, auditory, textual) evaluation framework for Traditional Chinese; (2) Standardized inference latency metric integrated into benchmarking; (3) End-to-end any-to-any architecture with speech-to-text baselines. Experiments show proprietary models achieve higher overall accuracy, while open-source models outperform them on audio-centric tasks; end-to-end inference reduces latency significantly—establishing a new paradigm for efficient multimodal modeling.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) process visual, acoustic, and textual inputs, addressing the limitations of single-modality LLMs. However, existing benchmarks often overlook tri-modal evaluation in Traditional Chinese and do not consider inference latency. To address this, we introduce Multi-TW, the first Traditional Chinese benchmark for evaluating the performance and latency of any-to-any multimodal models. Multi-TW includes 900 multiple-choice questions (image and text, audio and text pairs) sourced from official proficiency tests developed with the Steering Committee for the Test of Proficiency-Huayu (SC-TOP). We evaluated various any-to-any models and vision-language models (VLMs) with audio transcription. Our results show that closed-source models generally outperform open-source ones across modalities, although open-source models can perform well in audio tasks. End-to-end any-to-any pipelines offer clear latency advantages compared to VLMs using separate audio transcription. Multi-TW presents a comprehensive view of model capabilities and highlights the need for Traditional Chinese fine-tuning and efficient multimodal architectures.
Problem

Research questions and friction points this paper is trying to address.

Evaluating multimodal models in Traditional Chinese contexts
Assessing performance and latency of any-to-any multimodal models
Addressing lack of tri-modal benchmarks for Traditional Chinese
Innovation

Methods, ideas, or system contributions that make the work stand out.

First Traditional Chinese multimodal benchmark
Evaluates performance and latency together
Uses any-to-any end-to-end multimodal pipelines
🔎 Similar Papers
No similar papers found.
J
Jui-Ming Yao
National Taiwan University of Science and Technology, Taipei, Taiwan
B
Bing-Cheng Xie
National Taiwan University of Science and Technology, Taipei, Taiwan
S
Sheng-Wei Peng
National Taiwan University of Science and Technology, Taipei, Taiwan
Hao-Yuan Chen
Hao-Yuan Chen
University of London, Mindify AI
Quantum Machine LearningQuantum UtilityLLM ReasoningLLM Agent
H
He-Rong Zheng
National Taiwan University, Taipei, Taiwan
B
Bing-Jia Tan
National Taiwan University, Taipei, Taiwan
P
Peter Shaojui Wang
National Taiwan University of Science and Technology, Taipei, Taiwan
Shun-Feng Su
Shun-Feng Su
Professor of EE, National Taiwan University of Science and Technology
intelligent systems