TIBSTC-CoT: A Multi-Domain Instruction Dataset for Chain-of-Thought Reasoning in Language Models

📅 2025-08-03

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Low-resource languages like Tibetan lack Chain-of-Thought (CoT) reasoning data, hindering the development of reasoning-capable AI systems. Method: This work introduces TIBSTC-CoT—the first large-scale, multi-domain Tibetan CoT instruction dataset—and proposes a scalable, reproducible framework for CoT data construction in low-resource languages. It leverages large language models to automatically synthesize high-quality Tibetan CoT samples and trains the Tibetan-specialized Sunshine-thinking series via full supervised fine-tuning. Contribution/Results: This is the first systematic integration of CoT reasoning into Tibetan AI, substantially improving mathematical reasoning, logical inference, and complex generation capabilities. On multiple Tibetan benchmarks, Sunshine-thinking matches or exceeds state-of-the-art multilingual models (e.g., Qwen2, Llama3). Both the dataset and models are fully open-sourced, advancing inclusive, multilingual artificial intelligence.

Technology Category

Application Category

📝 Abstract

To address the severe data scarcity in Tibetan, a low-resource language spoken by over six million people, we introduce TIBSTC-CoT, the large-scale, multi-domain Tibetan dataset automatically constructed via chain-of-thought prompting with large language models (LLMs). TIBSTC-CoT establishes a scalable and reproducible framework for dataset creation in low-resource settings, covering diverse domains and reasoning patterns essential for language understanding and generation. Building on this dataset, we develop the Sunshine-thinking LLM family, a series of Tibetan-centric LLMs equipped with chain-of-thought capabilities. Trained entirely on TIBSTC-CoT, Sunshine-thinking has demonstrated strong reasoning and generation performance, comparable to state-of-the-art (SOTA) multilingual LLMs. Our work marks a significant step toward inclusive AI by enabling high-quality Tibetan language processing through both resource creation and model innovation. All data are available: https://github.com/Vicentvankor/sun-shine.

Problem

Research questions and friction points this paper is trying to address.

Addressing data scarcity in Tibetan language processing

Creating scalable dataset for low-resource language reasoning

Developing Tibetan-centric LLMs with chain-of-thought capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated Tibetan dataset creation via LLMs

Scalable framework for low-resource languages

Tibetan-centric LLMs with chain-of-thought reasoning

🔎 Similar Papers

No similar papers found.