VietMix: A Naturally Occurring Vietnamese-English Code-Mixed Corpus with Iterative Augmentation for Machine Translation

📅 2025-05-30

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

Low-resource code-mixed translation suffers from scarcity of authentic code-mixed data and low ecological validity of synthetic data. To address this, we introduce VietMix—the first real-world Vietnamese–English code-mixed parallel corpus—and propose a rule- and large language model (LLM)-guided synthetic data generation framework integrating multi-stage filtering and iterative enhancement. Our approach jointly ensures linguistic plausibility, pragmatic naturalness, and controllable data properties, establishing a new evaluation benchmark for low-resource code-mixed machine translation. Experiments demonstrate that our method achieves 71.84 and 81.77 on COMETkiwi and XCOMET, respectively. In LLM-based human preference evaluation, the enhanced model wins 54–56% of pairwise comparisons (excluding ties), significantly outperforming all baselines.

Technology Category

Application Category

📝 Abstract

Machine translation systems fail when processing code-mixed inputs for low-resource languages. We address this challenge by curating VietMix, a parallel corpus of naturally occurring code-mixed Vietnamese text paired with expert English translations. Augmenting this resource, we developed a complementary synthetic data generation pipeline. This pipeline incorporates filtering mechanisms to ensure syntactic plausibility and pragmatic appropriateness in code-mixing patterns. Experimental validation shows our naturalistic and complementary synthetic data boost models' performance, measured by translation quality estimation scores, of up to 71.84 on COMETkiwi and 81.77 on XCOMET. Triangulating positive results with LLM-based assessments, augmented models are favored over seed fine-tuned counterparts in approximately 49% of judgments (54-56% excluding ties). VietMix and our augmentation methodology advance ecological validity in neural MT evaluations and establish a framework for addressing code-mixed translation challenges across other low-resource pairs.

Problem

Research questions and friction points this paper is trying to address.

Machine translation fails for code-mixed low-resource languages

Lack of parallel corpus for Vietnamese-English code-mixed text

Need for synthetic data to improve translation quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Curated parallel corpus of natural code-mixed text

Synthetic data generation with filtering mechanisms

Improved translation via natural and synthetic data

🔎 Similar Papers

No similar papers found.