đ¤ AI Summary
Low-resource languages like Tibetan exhibit poor performance in large language models (LLMs) due to severe data scarcity and lack of specialized evaluation frameworks. Method: We construct the largest high-quality Tibetan pretraining corpus to dateâsystematically compiled from diverse sources and rigorously cleaned via language-specific heuristicsâand propose a continual pretraining paradigm tailored for low-resource languages. Leveraging a multilingual foundation model, we perform Tibetan-specific continual pretraining to develop Banzhida, a generative AI model, and establish the first comprehensive Tibetan benchmark for systematic model evaluation and optimization. Contribution/Results: Experiments demonstrate that Banzhida significantly outperforms open-source models of comparable parameter count and existing Tibetan-specific models across multiple public and custom Tibetan tasksâincluding understanding, generation, and reasoningâachieving, for the first time, systematic breakthroughs in both Tibetan language comprehension and generation capabilities.
đ Abstract
Large language models have achieved remarkable progress across many languages. However, Tibetan, as a representative low-resource language, is particularly underrepresented in existing models due to the scarcity of high-quality training corpora. To address this gap, we curate the largest Tibetan pre-training corpus to date, aggregating data from diverse sources and applying a dedicated data cleaning and processing pipeline tailored for Tibetan. With the curated data, we continue pre/post-training a multilingual base model into Banzhida, a multilingual large language model that advances generative AI for Tibetan. To evaluate the Tibetan capabilities of the model, we create new high-quality Tibetan benchmarks, and complement them with existing public benchmarks. Experimental results demonstrate that Banzhida consistently and significantly outperforms both open-source models of similar scale and Tibetan-tailored models across a wide range of tasks.