Chinese ModernBERT with Whole-Word Masking

📅 2025-10-14

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Existing encoder-only Transformer architectures exhibit incompatibility with Chinese word segmentation and morphological characteristics. To address this, we propose the first Chinese-specific encoder designed from scratch. Our method introduces: (i) a hardware-aware 32K BPE vocabulary; (ii) a dynamic whole-word masking curriculum grounded in Chinese compounding rules; (iii) an alternating local/global attention architecture with RoPE for extended context modeling; (iv) a damped-cosine learning rate schedule; and (v) bf16 mixed-precision training. We employ two-stage long-sequence pretraining followed by joint contrastive fine-tuning on SimCLUE and T2Ranking. On the CLUE benchmark, our model matches mainstream models in performance—achieving 0.505 (Pearson) and 0.537 (Spearman) on the SimCLUE test set. It also surpasses Qwen-0.6B-embedding in both long-sequence throughput and short-sequence inference latency, while delivering Pareto-optimal trade-offs among accuracy, latency, and memory efficiency.

Technology Category

Application Category

📝 Abstract

Encoder-only Transformers have advanced along three axes -- architecture, data, and systems -- yielding Pareto gains in accuracy, speed, and memory efficiency. Yet these improvements have not fully transferred to Chinese, where tokenization and morphology differ markedly from English. We introduce Chinese ModernBERT, a from-scratch Chinese encoder that couples: (i) a hardware-aware 32k BPE vocabulary tailored to frequent Chinese affixes/compounds, lowering the embedding budget; (ii) whole-word masking (WWM) with a dynamic masking curriculum (30% -> 15%) to align task difficulty with training progress; (iii) a two-stage pre-training pipeline that extends the native context from 1,024 to 8,192 tokens using RoPE and alternating local/global attention; and (iv) a damped-cosine learning-rate schedule for stable long-horizon optimization. We pre-train on ~1.2T Chinese tokens from CCI3-HQ, CCI4 (Chinese), and Cosmopedia-Chinese. On CLUE, Chinese ModernBERT is competitive with strong Chinese encoders under a unified fine-tuning protocol. Under bf16 it achieves high long-sequence throughput while maintaining strong short-sequence speed, reflecting benefits from budget allocation and attention design. To probe retrieval-oriented quality, we add a small amount of open contrastive data: fine-tuning on SimCLUE (~3M pairs) improves further when adding T2Ranking (~2M), reaching 0.505 (Pearson) / 0.537 (Spearman) on the SimCLUE test set. Under this open-data setting, Chinese ModernBERT surpasses Qwen-0.6B-embedding on SimCLUE, suggesting a clear scaling path for STS with additional curated pairs. We will release tokenizer and weights to facilitate reproducible research.

Problem

Research questions and friction points this paper is trying to address.

Developing a Chinese encoder addressing tokenization and morphology differences from English

Implementing whole-word masking with dynamic curriculum for Chinese language training

Extending context length and optimizing attention for efficient Chinese language processing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hardware-aware BPE vocabulary for Chinese affixes

Whole-word masking with dynamic curriculum strategy

Two-stage pre-training with extended context length

🔎 Similar Papers

No similar papers found.