CMLFormer: A Dual Decoder Transformer with Switching Point Learning for Code-Mixed Language Modeling

📅 2025-05-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Code-switching in multilingual text poses significant challenges for standard language models due to frequent language alternations. To address this, we propose a dual-decoder Transformer architecture featuring a shared encoder for joint multilingual representation and synchronized cross-decoder cross-attention to explicitly model language-switching dynamics and cross-lingual semantic alignment. We introduce a novel switch-point-aware pretraining objective that jointly optimizes translation reconstruction, structural consistency constraints, and fine-grained switch-position prediction. On the HASOC-2021 benchmark, our model achieves substantial improvements in F1 score, precision, and accuracy over state-of-the-art methods. Attention visualization confirms accurate localization of code-switch points. This work is the first to incorporate switch-point identification directly into the pretraining objective, establishing a new, interpretable, and robust paradigm for code-switching modeling.

Technology Category

Application Category

📝 Abstract
Code-mixed languages, characterized by frequent within-sentence language transitions, present structural challenges that standard language models fail to address. In this work, we propose CMLFormer, an enhanced multi-layer dual-decoder Transformer with a shared encoder and synchronized decoder cross-attention, designed to model the linguistic and semantic dynamics of code-mixed text. CMLFormer is pre-trained on an augmented Hinglish corpus with switching point and translation annotations with multiple new objectives specifically aimed at capturing switching behavior, cross-lingual structure, and code-mixing complexity. Our experiments show that CMLFormer improves F1 score, precision, and accuracy over other approaches on the HASOC-2021 benchmark under select pre-training setups. Attention analyses further show that it can identify and attend to switching points, validating its sensitivity to code-mixed structure. These results demonstrate the effectiveness of CMLFormer's architecture and multi-task pre-training strategy for modeling code-mixed languages.
Problem

Research questions and friction points this paper is trying to address.

Addresses structural challenges in code-mixed language modeling
Models linguistic dynamics of code-mixed text with dual-decoder Transformer
Improves performance in identifying language switching points and semantics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-decoder Transformer with shared encoder
Pre-trained with switching point annotations
Multi-task objectives for code-mixing complexity
🔎 Similar Papers
No similar papers found.
Aditeya Baral
Aditeya Baral
New York University
Natural Language ProcessingDeep LearningRepresentation LearningComputational Linguistics
A
Allen George Ajith
Courant Institute of Mathematical Sciences, New York University
Roshan Nayak
Roshan Nayak
RND Engineer @Synopsys Inc
Deep LearningNLP
M
Mrityunjay Abhijeet Bhanja
Tandon School of Engineering, New York University