Sun-Shine: A Large Language Model for Tibetan Culture

📅 2025-03-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of weak cultural adaptation, scarcity of high-quality data, and complex morphology hindering model performance in Tibetan NLP, this paper introduces Sun-Shine—the first large language model explicitly designed for Tibetan culture. Methodologically: (1) we propose a Llama-derived architecture tailored to Tibetan morphological features; (2) we construct TIB-STC, a multilingual, high-quality Tibetan instruction dataset spanning religious texts, literature, news, and dialogues; and (3) we employ instruction fine-tuning coupled with multi-task joint training to enable fine-grained cultural knowledge embedding and strong generalization under low-resource conditions. Experiments demonstrate that Sun-Shine significantly outperforms existing baselines across Tibetan language modeling, machine translation, and syntactic parsing. It maintains high accuracy and robustness in low-resource settings and exhibits preliminary capabilities toward embodied intelligence.

Technology Category

Application Category

📝 Abstract
Tibetan, a minority language in China, features a highly intricate grammatical structure, characterized by four verb tenses and a tense system with frequent irregularities, contributing to its extensive inflectional diversity. Recently, advances in Large Language Models (LLMs) have transformed the paradigm in many domains. Despite the success in other fields, current LLMs often fall short in catering to the needs of domain experts like Tibetans, and the potential of LLMs for Tibetan culture is under-explored. The intrinsic reasons are the immense and intricate nature of Tibetan culture as well as the necessity for higher granularity and richness in knowledge. Simultaneously, the complexity and uniqueness of its grammatical structure, coupled with its status as a minority ethnic language, contribute to data scarcity, which remains a fundamental challenge. To alleviate these issues, we introduce Llama-Sunshine (Sun-Shine), the first large language model for Tibetan culture, which is expert in various Tibetan language processing tasks. Sun-Shine incorporates state-of-the-art model architectures optimized for Tibetan's linguistic features. We also propose TIB-STC, a comprehensive dataset comprising diverse Tibetan texts such as literature, religious scripts, news, and conversational data, which is also the first large-scale dataset for Tibetan culture. Though comprehensive experiments, Sun-Shine not only demonstrates a higher level of knowledge expertise for Tibetan culture but also gains preliminary embodied intelligence capabilities in Tibetan language processing tasks, like language modeling, text classification, machine translation, and syntactic analysis. Moreover, it excels in low-resource scenarios, showcasing strong generalization capabilities.
Problem

Research questions and friction points this paper is trying to address.

Addressing data scarcity in Tibetan language processing
Optimizing LLMs for Tibetan's complex grammatical structure
Enhancing cultural knowledge granularity for Tibetan domain experts
Innovation

Methods, ideas, or system contributions that make the work stand out.

First LLM for Tibetan culture
Optimized architecture for Tibetan language
Comprehensive Tibetan dataset TIB-STC
🔎 Similar Papers
No similar papers found.
C
Cheng Huang
University of Texas Southwestern Medical Center
Fan Gao
Fan Gao
Caltech; MIT
NGS BioinformaticsImage data processingAI/MLNeurodegenerationProtein Bioinformatics
N
Nyima Tashi
Tibet University
Y
Yutong Liu
University of Electronic Science and Technology of China
Xiangxiang Wang
Xiangxiang Wang
University of Electronic Science and Technology of China
neural networkstime scalesnonlinear systemsimpulsive control
T
Thupten Tsering
Tibet University
M
Ma-bao Ban
Tibet University
R
Renzeg Duojie
Tibet University
Gadeng Luosang
Gadeng Luosang
Sichuan University, Tibet University
Multilingual natural language processingmedical image processing
R
Rinchen Dongrub
Tibet University
D
Dorje Tashi
Tibet University
X
Xiao Feng
University of Electronic Science and Technology of China
Yongbin Yu
Yongbin Yu
University of Electronic Science and Technology of China
Memristor、Neural Network、Natural Language Processing、Impulsive Control、Swarm Intelligence、EDA、MBSE