🤖 AI Summary
Tibetan, a low-resource language, has long lacked standardized infrastructure for vision–language research. This work presents the first comprehensive multimodal research suite for Tibetan, comprising a high-quality image–text dataset and five evaluation benchmarks, including a Tibetan-adapted MMBench. The authors propose a three-stage adaptation strategy based on Qwen3-VL-8B-Instruct—continuous pretraining, image–text alignment, and instruction fine-tuning—augmented with hierarchical quality control to mitigate translation-induced noise. Experimental results demonstrate substantial performance gains on FTibBench: MMBench accuracy improves from 42.97% to 67.78%, and POPE-random rises from 47.53% to 80.56%, while preserving the model’s original Chinese capabilities. This study delivers a reproducible, end-to-end toolkit that advances multimodal research for Tibetan.
📝 Abstract
Vision-language models have progressed rapidly, but Tibetan remains a severely underserved low-resource language due to the lack of reproducible training and evaluation infrastructure. To fill this gap, we introduce FTibSuite, a comprehensive resource suite for Tibetan vision-language research, consisting of FTibData (human-verified multimodal training corpora spanning continual pretraining, image-text alignment, and instruction tuning data), FTibBench (Tibetan adaptations of five mainstream multimodal benchmarks with a hierarchical quality-control workflow to reduce translation noise), and FTibVLM, a reproducible baseline built on Qwen3-VL-8B-Instruct via a three-stage adaptation pipeline. Experiments on FTibBench show FTibVLM delivers consistent performance gains across all tasks, such as improving MMBench accuracy from 42.97 to 67.78 and POPE-random accuracy from 47.53 to 80.56, while retaining the backbone's original Chinese capabilities with minimal degradation, providing the first standardized foundation for Tibetan multimodal research.