🤖 AI Summary
This work investigates the impact of cross-lingual transfer on euphemism detection for low-resource languages—specifically English, Spanish, Chinese, Turkish, and Yoruba. Addressing the modeling challenges posed by cultural variation and semantic ambiguity in euphemisms, we propose a sequential fine-tuning strategy: transferring knowledge progressively from high-resource (e.g., English) to low-resource languages. Using XLM-R and mBERT, we systematically compare monolingual fine-tuning, simultaneous multilingual fine-tuning, and sequential fine-tuning, analyzing the effects of language typology, pretraining coverage, and transfer paths. Results show that sequential fine-tuning significantly improves performance on low-resource languages—especially Yoruba—revealing pretraining data disparity as a key bottleneck. While XLM-R yields larger gains, it is more susceptible to catastrophic forgetting; mBERT demonstrates greater robustness. This study establishes an interpretable, reproducible transfer paradigm for implicit semantic understanding in low-resource settings.
📝 Abstract
Euphemisms are culturally variable and often ambiguous, posing challenges for language models, especially in low-resource settings. This paper investigates how cross-lingual transfer via sequential fine-tuning affects euphemism detection across five languages: English, Spanish, Chinese, Turkish, and Yoruba. We compare sequential fine-tuning with monolingual and simultaneous fine-tuning using XLM-R and mBERT, analyzing how performance is shaped by language pairings, typological features, and pretraining coverage. Results show that sequential fine-tuning with a high-resource L1 improves L2 performance, especially for low-resource languages like Yoruba and Turkish. XLM-R achieves larger gains but is more sensitive to pretraining gaps and catastrophic forgetting, while mBERT yields more stable, though lower, results. These findings highlight sequential fine-tuning as a simple yet effective strategy for improving euphemism detection in multilingual models, particularly when low-resource languages are involved.