TIBSTC-CoT: A Multi-Domain Instruction Dataset for Chain-of-Thought Reasoning in Language Models

📅 2025-08-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Low-resource languages like Tibetan lack Chain-of-Thought (CoT) reasoning data, hindering the development of reasoning-capable AI systems. Method: This work introduces TIBSTC-CoT—the first large-scale, multi-domain Tibetan CoT instruction dataset—and proposes a scalable, reproducible framework for CoT data construction in low-resource languages. It leverages large language models to automatically synthesize high-quality Tibetan CoT samples and trains the Tibetan-specialized Sunshine-thinking series via full supervised fine-tuning. Contribution/Results: This is the first systematic integration of CoT reasoning into Tibetan AI, substantially improving mathematical reasoning, logical inference, and complex generation capabilities. On multiple Tibetan benchmarks, Sunshine-thinking matches or exceeds state-of-the-art multilingual models (e.g., Qwen2, Llama3). Both the dataset and models are fully open-sourced, advancing inclusive, multilingual artificial intelligence.

Technology Category

Application Category

📝 Abstract
To address the severe data scarcity in Tibetan, a low-resource language spoken by over six million people, we introduce TIBSTC-CoT, the large-scale, multi-domain Tibetan dataset automatically constructed via chain-of-thought prompting with large language models (LLMs). TIBSTC-CoT establishes a scalable and reproducible framework for dataset creation in low-resource settings, covering diverse domains and reasoning patterns essential for language understanding and generation. Building on this dataset, we develop the Sunshine-thinking LLM family, a series of Tibetan-centric LLMs equipped with chain-of-thought capabilities. Trained entirely on TIBSTC-CoT, Sunshine-thinking has demonstrated strong reasoning and generation performance, comparable to state-of-the-art (SOTA) multilingual LLMs. Our work marks a significant step toward inclusive AI by enabling high-quality Tibetan language processing through both resource creation and model innovation. All data are available: https://github.com/Vicentvankor/sun-shine.
Problem

Research questions and friction points this paper is trying to address.

Addressing data scarcity in Tibetan language processing
Creating scalable dataset for low-resource language reasoning
Developing Tibetan-centric LLMs with chain-of-thought capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated Tibetan dataset creation via LLMs
Scalable framework for low-resource languages
Tibetan-centric LLMs with chain-of-thought reasoning
🔎 Similar Papers
No similar papers found.
Fan Gao
Fan Gao
Caltech; MIT
NGS BioinformaticsImage data processingAI/MLNeurodegenerationProtein Bioinformatics
C
Cheng Huang
University of Electronic Science and Technology of China
N
Nyima Tashi
Tibet University
Y
Yutong Liu
University of Electronic Science and Technology of China
Xiangxiang Wang
Xiangxiang Wang
University of Electronic Science and Technology of China
neural networkstime scalesnonlinear systemsimpulsive control
T
Thupten Tsering
Tibet University
B
Ban Ma-bao
Tibet University
R
Renzeg Duojie
Tibet University
Gadeng Luosang
Gadeng Luosang
Sichuan University, Tibet University
Multilingual natural language processingmedical image processing
R
Rinchen Dongrub
Tibet University
D
Dorje Tashi
Tibet University
X
Xiao Feng
University of Electronic Science and Technology of China
H
Hao Wang
University of Electronic Science and Technology of China
Yongbin Yu
Yongbin Yu
University of Electronic Science and Technology of China
Memristor、Neural Network、Natural Language Processing、Impulsive Control、Swarm Intelligence、EDA、MBSE