Beyond Monolingual Assumptions: A Survey of Code-Switched NLP in the Era of Large Language Models

📅 2025-10-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Code-switching (CSW)—the intra-sentential mixing of languages/scripts—remains a critical bottleneck for deploying large language models (LLMs) in multilingual societies, manifesting as poor comprehension of mixed inputs, scarce training data, and biased evaluation. This paper presents the first systematic survey of CSW-NLP research in the LLM era, covering five research directions, twelve tasks, and over thirty datasets. We propose a tripartite development framework: inclusive data construction, fair evaluation protocols, and linguistics-informed modeling. Through architectural analysis, taxonomy of training strategies, and integration of evaluation methodologies, we build the first open-source CSW resource repository supporting 80+ languages. Empirical analysis reveals significant limitations of mainstream LLMs in cross-lingual semantic alignment and syntactic consistency. Our work provides both theoretical foundations and practical benchmarks for trustworthy, culturally aware multilingual AI.

Technology Category

Application Category

📝 Abstract
Code-switching (CSW), the alternation of languages and scripts within a single utterance, remains a fundamental challenge for multiling ual NLP, even amidst the rapid advances of large language models (LLMs). Most LLMs still struggle with mixed-language inputs, limited CSW datasets, and evaluation biases, hindering deployment in multilingual societies. This survey provides the first comprehensive analysis of CSW-aware LLM research, reviewing otal{unique_references} studies spanning five research areas, 12 NLP tasks, 30+ datasets, and 80+ languages. We classify recent advances by architecture, training strategy, and evaluation methodology, outlining how LLMs have reshaped CSW modeling and what challenges persist. The paper concludes with a roadmap emphasizing the need for inclusive datasets, fair evaluation, and linguistically grounded models to achieve truly multilingual intelligence. A curated collection of all resources is maintained at https://github.com/lingo-iitgn/awesome-code-mixing/.
Problem

Research questions and friction points this paper is trying to address.

Addressing code-switching challenges in multilingual NLP with large language models
Overcoming limited datasets and evaluation biases in mixed-language processing
Developing linguistically grounded models for true multilingual intelligence
Innovation

Methods, ideas, or system contributions that make the work stand out.

Surveyed code-switched NLP with large language models
Classified advances by architecture and training strategy
Proposed roadmap for inclusive datasets and evaluation
🔎 Similar Papers
No similar papers found.
R
Rajvee Sheth
IIT Gandhinagar, LINGO Research Group
S
Samridhi Raj Sinha
NMIMS Mumbai, LINGO Research Group
M
Mahavir Patil
SVNIT Surat, LINGO Research Group
Himanshu Beniwal
Himanshu Beniwal
Indian Institute of Technology Gandhinagar
Natural Language ProcessingMachine LearningComputational LinguisticsDeep Learning
M
Mayank Singh
IIT Gandhinagar, LINGO Research Group