The Impact of Code-switched Synthetic Data Quality is Task Dependent: Insights from MT and ASR

📅 2025-03-30

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This study systematically investigates how the quality of synthetic code-switching data affects multi-task NLP performance across machine translation (MT), automatic speech recognition (ASR), and cascaded speech translation (ST). We propose the first unified cross-task analytical framework, revealing nonlinear, task-specific dependencies between data quality and model performance: high-quality synthetic data substantially improves MT; ASR performance is particularly sensitive to speech–language alignment fidelity; and low-quality augmentation degrades ST. Methodologically, we integrate multiple enhancement strategies—including lexical substitution, linguistics-informed modeling, and back-translation—and establish a comprehensive multi-task benchmark for evaluation. Our core contribution lies in transcending single-task assessment paradigms: we empirically validate the task-dependent nature of data augmentation efficacy, providing both theoretical foundations and practical guidelines for synthetic data curation and governance.

Technology Category

Application Category

📝 Abstract

Code-switching, the act of alternating between languages, emerged as a prevalent global phenomenon that needs to be addressed for building user-friendly language technologies. A main bottleneck in this pursuit is data scarcity, motivating research in the direction of code-switched data augmentation. However, current literature lacks comprehensive studies that enable us to understand the relation between the quality of synthetic data and improvements on NLP tasks. We extend previous research conducted in this direction on machine translation (MT) with results on automatic speech recognition (ASR) and cascaded speech translation (ST) to test generalizability of findings. Our experiments involve a wide range of augmentation techniques, covering lexical replacements, linguistic theories, and back-translation. Based on the results of MT, ASR, and ST, we draw conclusions and insights regarding the efficacy of various augmentation techniques and the impact of quality on performance.

Problem

Research questions and friction points this paper is trying to address.

Investigates code-switched synthetic data quality impact on NLP tasks

Compares augmentation techniques for machine translation and speech recognition

Evaluates generalizability of findings across different language technologies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses code-switched synthetic data augmentation

Tests MT, ASR, and ST for generalizability

Employs lexical, linguistic, and back-translation techniques

🔎 Similar Papers

Enhancing Multilingual Speech Generation and Recognition Abilities in LLMs with Constructed Code-switched Data