SiniticMTError: A Machine Translation Dataset with Error Annotations for Sinitic Languages

📅 2025-09-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Low-resource Chinese dialects (e.g., Cantonese, Wu) lack fine-grained, error-annotated machine translation (MT) evaluation datasets. Method: We construct the first trilingual parallel dataset covering English→Mandarin/Cantonese/Wu, featuring three-tier human annotations—error span, error type (e.g., grammatical, semantic, dialect-specific), and severity level—performed by native speakers and validated via iterative feedback and inter-annotator agreement checks. Contribution/Results: This dataset fills a critical gap in low-resource dialect MT evaluation and, for the first time, reveals systematic cross-dialectal patterns in error distribution and severity. It serves as foundational infrastructure for error-aware MT model training and evaluation, enabling rigorous quality diagnostics and robustness optimization for low-resource language translation.

Technology Category

Application Category

📝 Abstract
Despite major advances in machine translation (MT) in recent years, progress remains limited for many low-resource languages that lack large-scale training data and linguistic resources. Cantonese and Wu Chinese are two Sinitic examples, although each enjoys more than 80 million speakers around the world. In this paper, we introduce SiniticMTError, a novel dataset that builds on existing parallel corpora to provide error span, error type, and error severity annotations in machine-translated examples from English to Mandarin, Cantonese, and Wu Chinese. Our dataset serves as a resource for the MT community to utilize in fine-tuning models with error detection capabilities, supporting research on translation quality estimation, error-aware generation, and low-resource language evaluation. We report our rigorous annotation process by native speakers, with analyses on inter-annotator agreement, iterative feedback, and patterns in error type and severity.
Problem

Research questions and friction points this paper is trying to address.

Addressing limited machine translation progress for low-resource Sinitic languages
Providing error annotations for English-to-Sinitic machine translation outputs
Creating resources for error detection and quality estimation in MT systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dataset with error span and type annotations
Supports fine-tuning models for error detection
Focuses on low-resource Sinitic language evaluation
🔎 Similar Papers
No similar papers found.
H
Hannah Liu
University of Toronto
J
Junghyun Min
Georgetown University
E
Ethan Yue Heng Cheung
University of Toronto
S
Shou-Yi Hung
University of Toronto
S
Syed Mekael Wasti
Ontario Tech University
R
Runtong Liang
University of Toronto
S
Shiyao Qian
University of Toronto
S
Shizhao Zheng
University of Toronto
E
Elsie Chan
University of Toronto
K
Ka Ieng Charlotte Lo
University of Toronto
W
Wing Yu Yip
University of Toronto
Richard Tzong-Han Tsai
Richard Tzong-Han Tsai
National Central University; Academia Sinica
Natural Language ProcessingArtificial Intelligence
En-Shiun Annie Lee
En-Shiun Annie Lee
Ontario Tech University, and University of Toronto (Status-Only)
Natural Language ProcessingData MiningPattern Analysis