Banzhida: Advancing Large Language Models for Tibetan with Curated Data and Continual Pre-Training

📅 2025-07-12
📈 Citations: 0
✨ Influential: 0
📄 PDF
🤖 AI Summary
Low-resource languages like Tibetan exhibit poor performance in large language models (LLMs) due to severe data scarcity and lack of specialized evaluation frameworks. Method: We construct the largest high-quality Tibetan pretraining corpus to date—systematically compiled from diverse sources and rigorously cleaned via language-specific heuristics—and propose a continual pretraining paradigm tailored for low-resource languages. Leveraging a multilingual foundation model, we perform Tibetan-specific continual pretraining to develop Banzhida, a generative AI model, and establish the first comprehensive Tibetan benchmark for systematic model evaluation and optimization. Contribution/Results: Experiments demonstrate that Banzhida significantly outperforms open-source models of comparable parameter count and existing Tibetan-specific models across multiple public and custom Tibetan tasks—including understanding, generation, and reasoning—achieving, for the first time, systematic breakthroughs in both Tibetan language comprehension and generation capabilities.

Technology Category

Application Category

📝 Abstract
Large language models have achieved remarkable progress across many languages. However, Tibetan, as a representative low-resource language, is particularly underrepresented in existing models due to the scarcity of high-quality training corpora. To address this gap, we curate the largest Tibetan pre-training corpus to date, aggregating data from diverse sources and applying a dedicated data cleaning and processing pipeline tailored for Tibetan. With the curated data, we continue pre/post-training a multilingual base model into Banzhida, a multilingual large language model that advances generative AI for Tibetan. To evaluate the Tibetan capabilities of the model, we create new high-quality Tibetan benchmarks, and complement them with existing public benchmarks. Experimental results demonstrate that Banzhida consistently and significantly outperforms both open-source models of similar scale and Tibetan-tailored models across a wide range of tasks.
Problem

Research questions and friction points this paper is trying to address.

Addressing scarcity of high-quality Tibetan training corpora
Developing a multilingual model for Tibetan language
Creating benchmarks to evaluate Tibetan language capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Curated largest Tibetan pre-training corpus
Continued pre-training multilingual base model
Created new Tibetan benchmarks for evaluation
🔎 Similar Papers
No similar papers found.
Leiyu Pan
Leiyu Pan
Tianjin University
Natural Language ProcessingMultilingualMachine Translation
B
Bojian Xiong
TJUNLP Lab, Tianjin University
L
Lei Yang
TJUNLP Lab, Tianjin University
Renren Jin
Renren Jin
College of Intelligence and Computing, Tianjin University
Natural Language Processing
S
Shaowei Zhang
TJUNLP Lab, Tianjin University
Y
Yue Chen
TJUNLP Lab, Tianjin University
L
Ling Shi
TJUNLP Lab, Tianjin University
J
Jiang Zhou
TJUNLP Lab, Tianjin University
J
Junru Wu
TJUNLP Lab, Tianjin University
Z
Zhen Wang
TJUNLP Lab, Tianjin University
Jianxiang Peng
Jianxiang Peng
Tianjin University
NLP
J
Juesi Xiao
TJUNLP Lab, Tianjin University
T
Tianyu Dong
TJUNLP Lab, Tianjin University
Z
Zhuowen Han
TJUNLP Lab, Tianjin University
Z
Zhuo Chen
TJUNLP Lab, Tianjin University
S
Sangjee Dondrub
Qinghai Normal University
C
Caizang Tai
Qinghai Normal University
H
Haixing Zhao
Qinghai Normal University
H
Huaque Cairang
Qinghai Normal University
S
Suonan Cairang
Qinghai Normal University
R
Rou Te
Qinghai Normal University
L
Lengben Zhaxi
Qinghai Normal University
G
Gazang Zhaxi
Qinghai Normal University
Z
Zhonglin Ye
Qinghai Normal University
Yuhui Zheng
Yuhui Zheng
Full Professor with school of Computer and Software, NUIST
Computer Vision、Multimedia Forensics、Digital Watermarking