VietMix: A Naturally Occurring Vietnamese-English Code-Mixed Corpus with Iterative Augmentation for Machine Translation

📅 2025-05-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Low-resource code-mixed translation suffers from scarcity of authentic code-mixed data and low ecological validity of synthetic data. To address this, we introduce VietMix—the first real-world Vietnamese–English code-mixed parallel corpus—and propose a rule- and large language model (LLM)-guided synthetic data generation framework integrating multi-stage filtering and iterative enhancement. Our approach jointly ensures linguistic plausibility, pragmatic naturalness, and controllable data properties, establishing a new evaluation benchmark for low-resource code-mixed machine translation. Experiments demonstrate that our method achieves 71.84 and 81.77 on COMETkiwi and XCOMET, respectively. In LLM-based human preference evaluation, the enhanced model wins 54–56% of pairwise comparisons (excluding ties), significantly outperforming all baselines.

Technology Category

Application Category

📝 Abstract
Machine translation systems fail when processing code-mixed inputs for low-resource languages. We address this challenge by curating VietMix, a parallel corpus of naturally occurring code-mixed Vietnamese text paired with expert English translations. Augmenting this resource, we developed a complementary synthetic data generation pipeline. This pipeline incorporates filtering mechanisms to ensure syntactic plausibility and pragmatic appropriateness in code-mixing patterns. Experimental validation shows our naturalistic and complementary synthetic data boost models' performance, measured by translation quality estimation scores, of up to 71.84 on COMETkiwi and 81.77 on XCOMET. Triangulating positive results with LLM-based assessments, augmented models are favored over seed fine-tuned counterparts in approximately 49% of judgments (54-56% excluding ties). VietMix and our augmentation methodology advance ecological validity in neural MT evaluations and establish a framework for addressing code-mixed translation challenges across other low-resource pairs.
Problem

Research questions and friction points this paper is trying to address.

Machine translation fails for code-mixed low-resource languages
Lack of parallel corpus for Vietnamese-English code-mixed text
Need for synthetic data to improve translation quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Curated parallel corpus of natural code-mixed text
Synthetic data generation with filtering mechanisms
Improved translation via natural and synthetic data
🔎 Similar Papers
No similar papers found.
Hieu Tran
Hieu Tran
University of Maryland, College Park
Natural Language ProcessingLarge Language Models
P
Phuong-Anh Nguyen-Le
University of Maryland, College Park
H
Huy Nghiem
University of Maryland, College Park
Q
Quang-Nhan Nguyen
Harvard University
W
Wei Ai
University of Maryland, College Park
Marine Carpuat
Marine Carpuat
Associate Professor, Computer Science, University of Maryland
Natural Language ProcessingMachine Translation