🤖 AI Summary
Low-resource languages like Tibetan lack Chain-of-Thought (CoT) reasoning data, hindering the development of reasoning-capable AI systems. Method: This work introduces TIBSTC-CoT—the first large-scale, multi-domain Tibetan CoT instruction dataset—and proposes a scalable, reproducible framework for CoT data construction in low-resource languages. It leverages large language models to automatically synthesize high-quality Tibetan CoT samples and trains the Tibetan-specialized Sunshine-thinking series via full supervised fine-tuning. Contribution/Results: This is the first systematic integration of CoT reasoning into Tibetan AI, substantially improving mathematical reasoning, logical inference, and complex generation capabilities. On multiple Tibetan benchmarks, Sunshine-thinking matches or exceeds state-of-the-art multilingual models (e.g., Qwen2, Llama3). Both the dataset and models are fully open-sourced, advancing inclusive, multilingual artificial intelligence.
📝 Abstract
To address the severe data scarcity in Tibetan, a low-resource language spoken by over six million people, we introduce TIBSTC-CoT, the large-scale, multi-domain Tibetan dataset automatically constructed via chain-of-thought prompting with large language models (LLMs). TIBSTC-CoT establishes a scalable and reproducible framework for dataset creation in low-resource settings, covering diverse domains and reasoning patterns essential for language understanding and generation. Building on this dataset, we develop the Sunshine-thinking LLM family, a series of Tibetan-centric LLMs equipped with chain-of-thought capabilities. Trained entirely on TIBSTC-CoT, Sunshine-thinking has demonstrated strong reasoning and generation performance, comparable to state-of-the-art (SOTA) multilingual LLMs. Our work marks a significant step toward inclusive AI by enabling high-quality Tibetan language processing through both resource creation and model innovation. All data are available: https://github.com/Vicentvankor/sun-shine.