llm-jp-modernbert: A ModernBERT Model Trained on a Large-Scale Japanese Corpus with Long Context Length

📅 2025-04-22

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

Encoder models like BERT remain underexplored for large-scale pretraining on extensive corpora and long contexts, particularly for Japanese. Method: We propose and train llm-jp-modernbert—the first large-scale, long-context ModernBERT encoder model for Japanese—pretrained on massive public Japanese corpora with an 8,192-token context length. This work pioneers the application of the ModernBERT architecture to large-scale Japanese encoder pretraining and systematically investigates how extended context lengths affect masked language modeling performance and the dynamic evolution of sentence embeddings. Results: Experiments demonstrate significant gains over baselines in fill-mask tasks; sentence embedding evolution aligns with ModernBERT theoretical predictions. We publicly release the model weights, training code, and evaluation framework—establishing the first reproducible Japanese benchmark and foundational resources for long-context BERT research.

Technology Category

Application Category

📝 Abstract

Encoder-only transformer models like BERT are widely adopted as a pre-trained backbone for tasks like sentence classification and retrieval. However, pretraining of encoder models with large-scale corpora and long contexts has been relatively underexplored compared to decoder-only transformers. In this work, we present llm-jp-modernbert, a ModernBERT model trained on a publicly available, massive Japanese corpus with a context length of 8192 tokens. While our model does not surpass existing baselines on downstream tasks, it achieves good results on fill-mask test evaluations. We also analyze the effect of context length expansion through pseudo-perplexity experiments. Furthermore, we investigate sentence embeddings in detail, analyzing their transitions during training and comparing them with those from other existing models, confirming similar trends with models sharing the same architecture. To support reproducibility and foster the development of long-context BERT, we release our model, along with the training and evaluation code.

Problem

Research questions and friction points this paper is trying to address.

Exploring large-scale pretraining for encoder models like BERT

Extending context length in BERT models for Japanese text

Analyzing sentence embeddings and training dynamics in ModernBERT

Innovation

Methods, ideas, or system contributions that make the work stand out.

ModernBERT trained on large Japanese corpus

Long context length of 8192 tokens

Released model and code for reproducibility

🔎 Similar Papers

No similar papers found.