Disentangling Codemixing in Chats: The NUS ABC Codemixed Corpus

📅 2025-05-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
A publicly available, author-annotated, and privacy-compliant multilingual code-mixed dialogue corpus is currently lacking, hindering computational and sociolinguistic research in informal settings (e.g., social media, instant messaging). Method: We construct the first large-scale, context-aware, human-annotated, and ethically compliant general-purpose code-switching dialogue corpus, covering major language pairs—including English–Mandarin—and comprising 355,000 authentic chat messages. The corpus employs multi-tier quality validation, structured JSON release, fine-grained metadata modeling, and comprehensive language-statistical analysis. Contribution/Results: This resource fills a critical gap in informal multilingual interaction data. It provides an open-source, extensible benchmark with rich linguistic annotations—enabling robust code-switching modeling, cross-lingual understanding, and empirical sociolinguistic inquiry.

Technology Category

Application Category

📝 Abstract
Code-mixing involves the seamless integration of linguistic elements from multiple languages within a single discourse, reflecting natural multilingual communication patterns. Despite its prominence in informal interactions such as social media, chat messages and instant-messaging exchanges, there has been a lack of publicly available corpora that are author-labeled and suitable for modeling human conversations and relationships. This study introduces the first labeled and general-purpose corpus for understanding code-mixing in context while maintaining rigorous privacy and ethical standards. Our live project will continuously gather, verify, and integrate code-mixed messages into a structured dataset released in JSON format, accompanied by detailed metadata and linguistic statistics. To date, it includes over 355,641 messages spanning various code-mixing patterns, with a primary focus on English, Mandarin, and other languages. We expect the Codemix Corpus to serve as a foundational dataset for research in computational linguistics, sociolinguistics, and NLP applications.
Problem

Research questions and friction points this paper is trying to address.

Lack of labeled corpora for code-mixed chat analysis
Need for privacy-compliant multilingual conversation datasets
Modeling diverse code-mixing patterns in NLP research
Innovation

Methods, ideas, or system contributions that make the work stand out.

First labeled general-purpose codemixed corpus
Live project with continuous data integration
JSON dataset with metadata and statistics
🔎 Similar Papers
No similar papers found.
S
S. Churina
Centre for Trusted Internet & Community, National University of Singapore, Singapore
Akshat Gupta
Akshat Gupta
UC Berkeley
Knowledge EditingNatural Language ProcessingSpoken Language Modeling
I
Insyirah Mujtahid
Centre for Trusted Internet & Community, National University of Singapore, Singapore
Kokil Jaidka
Kokil Jaidka
Associate Professor, National University of Singapore
social mediacomputational social sciencecomputational psychologyaffordances