Banzhida: Advancing Large Language Models for Tibetan with Curated Data and Continual Pre-Training

📅 2025-07-12

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Low-resource languages like Tibetan exhibit poor performance in large language models (LLMs) due to severe data scarcity and lack of specialized evaluation frameworks. Method: We construct the largest high-quality Tibetan pretraining corpus to date—systematically compiled from diverse sources and rigorously cleaned via language-specific heuristics—and propose a continual pretraining paradigm tailored for low-resource languages. Leveraging a multilingual foundation model, we perform Tibetan-specific continual pretraining to develop Banzhida, a generative AI model, and establish the first comprehensive Tibetan benchmark for systematic model evaluation and optimization. Contribution/Results: Experiments demonstrate that Banzhida significantly outperforms open-source models of comparable parameter count and existing Tibetan-specific models across multiple public and custom Tibetan tasks—including understanding, generation, and reasoning—achieving, for the first time, systematic breakthroughs in both Tibetan language comprehension and generation capabilities.

Technology Category

Application Category

📝 Abstract

Large language models have achieved remarkable progress across many languages. However, Tibetan, as a representative low-resource language, is particularly underrepresented in existing models due to the scarcity of high-quality training corpora. To address this gap, we curate the largest Tibetan pre-training corpus to date, aggregating data from diverse sources and applying a dedicated data cleaning and processing pipeline tailored for Tibetan. With the curated data, we continue pre/post-training a multilingual base model into Banzhida, a multilingual large language model that advances generative AI for Tibetan. To evaluate the Tibetan capabilities of the model, we create new high-quality Tibetan benchmarks, and complement them with existing public benchmarks. Experimental results demonstrate that Banzhida consistently and significantly outperforms both open-source models of similar scale and Tibetan-tailored models across a wide range of tasks.

Problem

Research questions and friction points this paper is trying to address.

Addressing scarcity of high-quality Tibetan training corpora

Developing a multilingual model for Tibetan language

Creating benchmarks to evaluate Tibetan language capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Curated largest Tibetan pre-training corpus

Continued pre-training multilingual base model

Created new Tibetan benchmarks for evaluation

🔎 Similar Papers

No similar papers found.