🤖 AI Summary
A publicly available, author-annotated, and privacy-compliant multilingual code-mixed dialogue corpus is currently lacking, hindering computational and sociolinguistic research in informal settings (e.g., social media, instant messaging).
Method: We construct the first large-scale, context-aware, human-annotated, and ethically compliant general-purpose code-switching dialogue corpus, covering major language pairs—including English–Mandarin—and comprising 355,000 authentic chat messages. The corpus employs multi-tier quality validation, structured JSON release, fine-grained metadata modeling, and comprehensive language-statistical analysis.
Contribution/Results: This resource fills a critical gap in informal multilingual interaction data. It provides an open-source, extensible benchmark with rich linguistic annotations—enabling robust code-switching modeling, cross-lingual understanding, and empirical sociolinguistic inquiry.
📝 Abstract
Code-mixing involves the seamless integration of linguistic elements from multiple languages within a single discourse, reflecting natural multilingual communication patterns. Despite its prominence in informal interactions such as social media, chat messages and instant-messaging exchanges, there has been a lack of publicly available corpora that are author-labeled and suitable for modeling human conversations and relationships. This study introduces the first labeled and general-purpose corpus for understanding code-mixing in context while maintaining rigorous privacy and ethical standards. Our live project will continuously gather, verify, and integrate code-mixed messages into a structured dataset released in JSON format, accompanied by detailed metadata and linguistic statistics. To date, it includes over 355,641 messages spanning various code-mixing patterns, with a primary focus on English, Mandarin, and other languages. We expect the Codemix Corpus to serve as a foundational dataset for research in computational linguistics, sociolinguistics, and NLP applications.