NeoDictaBERT: Pushing the Frontier of BERT models for Hebrew

📅 2025-10-23

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

To address the suboptimal performance of existing BERT models for Hebrew—a low-resource Semitic language—this work proposes an enhanced NeoBERT architecture, the first to incorporate modern Transformer innovations (e.g., extended context windows and efficient attention mechanisms) into Hebrew pretraining. We conduct large-scale monolingual pretraining on Hebrew text and follow it with bilingual (Hebrew–English) alignment fine-tuning, yielding high-performance monolingual and bilingual BERT models. Experimental results demonstrate consistent and significant improvements over prior approaches across all major Hebrew benchmarks, including part-of-speech tagging, named entity recognition, dependency parsing, and question answering; notably, the bilingual model achieves substantial gains in cross-lingual retrieval. All models are publicly released, thereby bridging a critical gap in NLP resources for low-resource Semitic languages.

Technology Category

Application Category

📝 Abstract

Since their initial release, BERT models have demonstrated exceptional performance on a variety of tasks, despite their relatively small size (BERT-base has ~100M parameters). Nevertheless, the architectural choices used in these models are outdated compared to newer transformer-based models such as Llama3 and Qwen3. In recent months, several architectures have been proposed to close this gap. ModernBERT and NeoBERT both show strong improvements on English benchmarks and significantly extend the supported context window. Following their successes, we introduce NeoDictaBERT and NeoDictaBERT-bilingual: BERT-style models trained using the same architecture as NeoBERT, with a dedicated focus on Hebrew texts. These models outperform existing ones on almost all Hebrew benchmarks and provide a strong foundation for downstream tasks. Notably, the NeoDictaBERT-bilingual model shows strong results on retrieval tasks, outperforming other multilingual models of similar size. In this paper, we describe the training process and report results across various benchmarks. We release the models to the community as part of our goal to advance research and development in Hebrew NLP.

Problem

Research questions and friction points this paper is trying to address.

Modernizing BERT architecture for Hebrew language processing

Improving Hebrew NLP benchmarks with NeoBERT-based models

Enhancing bilingual retrieval performance for Hebrew texts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapted NeoBERT architecture for Hebrew language

Enhanced BERT models for Hebrew NLP tasks

Bilingual model excels in retrieval performance

🔎 Similar Papers

No similar papers found.