🤖 AI Summary
Existing Indo-English code-switching datasets largely rely on Romanized text or synthetic data, suffering from narrow coverage, low naturalness, and absence of human-annotated evaluation. To address these limitations, we introduce the first large-scale, human-annotated Indo-English code-switching dataset comprising 100,970 authentic social media utterances, uniquely supporting both Devanagari and Roman scripts. The dataset enables five core NLP tasks: language identification, dominant-language detection, part-of-speech tagging, named entity recognition, and machine translation. We propose a novel three-expert collaborative annotation protocol, incorporating iterative consensus building and rigorous quality control to enable fine-grained human assessment of code-switching naturalness and acceptability—unprecedented in prior work. Annotation guidelines are harmonized across tasks for downstream compatibility, and the dataset is publicly released on Hugging Face. Empirical evaluation reveals substantial performance bottlenecks of state-of-the-art multilingual LMs on this benchmark.
📝 Abstract
The rapid growth of digital communication has driven the widespread use of code-mixing, particularly Hindi-English, in multilingual communities. Existing datasets often focus on romanized text, have limited scope, or rely on synthetic data, which fails to capture realworld language nuances. Human annotations are crucial for assessing the naturalness and acceptability of code-mixed text. To address these challenges, We introduce COMI-LINGUA, the largest manually annotated dataset for code-mixed text, comprising 100,970 instances evaluated by three expert annotators in both Devanagari and Roman scripts. The dataset supports five fundamental NLP tasks: Language Identification, Matrix Language Identification, Part-of-Speech Tagging, Named Entity Recognition, and Translation. We evaluate LLMs on these tasks using COMILINGUA, revealing limitations in current multilingual modeling strategies and emphasizing the need for improved code-mixed text processing capabilities. COMI-LINGUA is publically availabe at: https://huggingface.co/datasets/LingoIITGN/COMI-LINGUA.