extsc{CantoNLU}: A benchmark for Cantonese natural language understanding

📅 2025-10-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Cantonese, a low-resource language spoken by over ten million people, lacks a systematic natural language understanding (NLU) evaluation benchmark. To address this gap, we introduce CantoNLU—the first comprehensive Cantonese NLU benchmark—comprising seven tasks spanning syntactic and semantic levels. Methodologically, we construct and publicly release high-quality Cantonese datasets, and comparatively evaluate three model categories: (i) off-the-shelf Chinese LMs, (ii) Cantonese-adapted models via continued pretraining on Cantonese data, and (iii) monolingual Cantonese models trained from scratch. Experimental results reveal three key findings: (1) Cantonese-adapted models achieve the best overall performance; (2) monolingual models significantly outperform others on syntactic tasks; and (3) Chinese models remain competitive in low-data regimes. This work establishes a reproducible evaluation framework and provides open-source resources for dialectal NLP, advancing low-resource language technology.

Technology Category

Application Category

📝 Abstract
Cantonese, although spoken by millions, remains under-resourced due to policy and diglossia. To address this scarcity of evaluation frameworks for Cantonese, we introduce extsc{ extbf{CantoNLU}}, a benchmark for Cantonese natural language understanding (NLU). This novel benchmark spans seven tasks covering syntax and semantics, including word sense disambiguation, linguistic acceptability judgment, language detection, natural language inference, sentiment analysis, part-of-speech tagging, and dependency parsing. In addition to the benchmark, we provide model baseline performance across a set of models: a Mandarin model without Cantonese training, two Cantonese-adapted models obtained by continual pre-training a Mandarin model on Cantonese text, and a monolingual Cantonese model trained from scratch. Results show that Cantonese-adapted models perform best overall, while monolingual models perform better on syntactic tasks. Mandarin models remain competitive in certain settings, indicating that direct transfer may be sufficient when Cantonese domain data is scarce. We release all datasets, code, and model weights to facilitate future research in Cantonese NLP.
Problem

Research questions and friction points this paper is trying to address.

Addresses scarcity of evaluation frameworks for Cantonese natural language understanding
Provides benchmark covering seven NLU tasks including syntax and semantics
Evaluates model performance across multilingual and monolingual Cantonese approaches
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduced CantoNLU benchmark for Cantonese NLU
Provided baseline models with Cantonese-adapted training
Released datasets and models for Cantonese research
🔎 Similar Papers
No similar papers found.
J
Junghyun Min
Georgetown University
Y
York Hay Ng
University of Toronto
S
Sophia Chan
Independent Researcher
H
Helena Shunhua Zhao
University of Toronto
En-Shiun Annie Lee
En-Shiun Annie Lee
Ontario Tech University, and University of Toronto (Status-Only)
Natural Language ProcessingData MiningPattern Analysis