🤖 AI Summary
This work addresses the tendency of multilingual encoders to produce representations biased toward dominant languages—such as English—when processing code-mixed text, resulting in insufficient semantic alignment with the constituent languages. To investigate this issue, the authors construct an English–Hindi–Romanized code-mixed parallel corpus and employ centered kernel alignment (CKA) for representation similarity analysis, token-level saliency, and entropy-based uncertainty measures to reveal representational asymmetries. They propose a trilingual post-training alignment objective that strengthens connections between code-mixed representations and their source languages while preserving bilingual alignment. Evaluated on sentiment analysis and hate speech detection tasks, the approach significantly improves downstream performance, demonstrating the effectiveness of anchoring code-mixed representations to their constituent languages.
📝 Abstract
Multilingual encoder-based language models are widely adopted for code-mixed analysis tasks, yet we know surprisingly little about how they represent code-mixed inputs internally - or whether those representations meaningfully connect to the constituent languages being mixed. Using Hindi-English as a case study, we construct a unified trilingual corpus of parallel English, Hindi (Devanagari), and Romanized code-mixed sentences, and probe cross-lingual representation alignment across standard multilingual encoders and their code-mixed adapted variants via CKA, token-level saliency, and entropy-based uncertainty analysis. We find that while standard models align English and Hindi well, code-mixed inputs remain loosely connected to either language - and that continued pre-training on code-mixed data improves English-code-mixed alignment at the cost of English-Hindi alignment. Interpretability analyses further reveal a clear asymmetry: models process code-mixed text through an English-dominant semantic subspace, while native-script Hindi provides complementary signals that reduce representational uncertainty. Motivated by these findings, we introduce a trilingual post-training alignment objective that brings code-mixed representations closer to both constituent languages simultaneously, yielding more balanced cross-lingual alignment and downstream gains on sentiment analysis and hate speech detection - showing that grounding code-mixed representations in their constituent languages meaningfully helps cross-lingual understanding.