COMI-LINGUA: Expert Annotated Large-Scale Dataset for Multitask NLP in Hindi-English Code-Mixing

📅 2025-03-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing Indo-English code-switching datasets largely rely on Romanized text or synthetic data, suffering from narrow coverage, low naturalness, and absence of human-annotated evaluation. To address these limitations, we introduce the first large-scale, human-annotated Indo-English code-switching dataset comprising 100,970 authentic social media utterances, uniquely supporting both Devanagari and Roman scripts. The dataset enables five core NLP tasks: language identification, dominant-language detection, part-of-speech tagging, named entity recognition, and machine translation. We propose a novel three-expert collaborative annotation protocol, incorporating iterative consensus building and rigorous quality control to enable fine-grained human assessment of code-switching naturalness and acceptability—unprecedented in prior work. Annotation guidelines are harmonized across tasks for downstream compatibility, and the dataset is publicly released on Hugging Face. Empirical evaluation reveals substantial performance bottlenecks of state-of-the-art multilingual LMs on this benchmark.

Technology Category

Application Category

📝 Abstract
The rapid growth of digital communication has driven the widespread use of code-mixing, particularly Hindi-English, in multilingual communities. Existing datasets often focus on romanized text, have limited scope, or rely on synthetic data, which fails to capture realworld language nuances. Human annotations are crucial for assessing the naturalness and acceptability of code-mixed text. To address these challenges, We introduce COMI-LINGUA, the largest manually annotated dataset for code-mixed text, comprising 100,970 instances evaluated by three expert annotators in both Devanagari and Roman scripts. The dataset supports five fundamental NLP tasks: Language Identification, Matrix Language Identification, Part-of-Speech Tagging, Named Entity Recognition, and Translation. We evaluate LLMs on these tasks using COMILINGUA, revealing limitations in current multilingual modeling strategies and emphasizing the need for improved code-mixed text processing capabilities. COMI-LINGUA is publically availabe at: https://huggingface.co/datasets/LingoIITGN/COMI-LINGUA.
Problem

Research questions and friction points this paper is trying to address.

Lack of large-scale annotated dataset for Hindi-English code-mixed text
Insufficient capture of real-world language nuances in existing datasets
Limited multilingual model performance on code-mixed NLP tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Largest manually annotated Hindi-English code-mixed dataset
Supports five fundamental NLP tasks simultaneously
Evaluates LLMs on real-world code-mixed text processing
🔎 Similar Papers
No similar papers found.
R
Rajvee Sheth
LINGO, Indian Institute of Technology Gandhinagar, India
Himanshu Beniwal
Himanshu Beniwal
Indian Institute of Technology Gandhinagar
Natural Language ProcessingMachine LearningComputational LinguisticsDeep Learning
M
Mayank Singh
LINGO, Indian Institute of Technology Gandhinagar, India