🤖 AI Summary
This work addresses the scarcity of high-quality open-source large language models (LLMs) for low-resource languages like Hebrew, which hinders localized applications. We present the first systematic effort to develop a family of Hebrew LLMs at three scales—1.7B, 12B, and 24B parameters—based on Mistral-Small-3.1, NVIDIA Nemotron Nano V2, and Qwen3-1.7B, respectively. These models are adapted using large-scale Hebrew–English mixed corpora, support 65K-token context lengths and tool calling, and are released in both base and chat variants. Additionally, we introduce the first comprehensive evaluation benchmark for Hebrew chat models, demonstrating strong performance across tasks including translation, summarization, Winograd schema resolution, Israeli commonsense question answering, and nikud (vowel diacritic) restoration. The proposed framework is readily generalizable to other non-English languages and significantly advances Hebrew natural language processing.
📝 Abstract
Open-weight LLMs have been released by frontier labs; however, sovereign Large Language Models (for languages other than English) remain low in supply yet high in demand. Training large language models (LLMs) for low-resource languages such as Hebrew poses unique challenges. In this paper, we introduce Dicta-LM 3.0: an open-weight collection of LLMs trained on substantially-sized corpora of Hebrew and English texts. The model is released in three sizes: 24B - adapted from the Mistral-Small-3.1 base model, 12B - adapted from the NVIDIA Nemotron Nano V2 model, and 1.7B - adapted from the Qwen3-1.7B base model. We are releasing multiple variants of each model, each with a native context length of 65k tokens; base model and chat model with tool-calling support. To rigorously evaluate our models, we introduce a new benchmark suite for evaluation of Hebrew chat-LLMs, covering a diverse set of tasks including Translation, Summarization, Winograd, Israeli Trivia, and Diacritization (nikud). Our work not only addresses the intricacies of training LLMs in low-resource languages but also proposes a framework that can be leveraged for adapting other LLMs to various non-English languages, contributing to the broader field of multilingual NLP.