FTibSuite: A Comprehensive Resource Suite for Tibetan Vision-Language Modeling

📅 2026-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Tibetan, a low-resource language, has long lacked standardized infrastructure for vision–language research. This work presents the first comprehensive multimodal research suite for Tibetan, comprising a high-quality image–text dataset and five evaluation benchmarks, including a Tibetan-adapted MMBench. The authors propose a three-stage adaptation strategy based on Qwen3-VL-8B-Instruct—continuous pretraining, image–text alignment, and instruction fine-tuning—augmented with hierarchical quality control to mitigate translation-induced noise. Experimental results demonstrate substantial performance gains on FTibBench: MMBench accuracy improves from 42.97% to 67.78%, and POPE-random rises from 47.53% to 80.56%, while preserving the model’s original Chinese capabilities. This study delivers a reproducible, end-to-end toolkit that advances multimodal research for Tibetan.
📝 Abstract
Vision-language models have progressed rapidly, but Tibetan remains a severely underserved low-resource language due to the lack of reproducible training and evaluation infrastructure. To fill this gap, we introduce FTibSuite, a comprehensive resource suite for Tibetan vision-language research, consisting of FTibData (human-verified multimodal training corpora spanning continual pretraining, image-text alignment, and instruction tuning data), FTibBench (Tibetan adaptations of five mainstream multimodal benchmarks with a hierarchical quality-control workflow to reduce translation noise), and FTibVLM, a reproducible baseline built on Qwen3-VL-8B-Instruct via a three-stage adaptation pipeline. Experiments on FTibBench show FTibVLM delivers consistent performance gains across all tasks, such as improving MMBench accuracy from 42.97 to 67.78 and POPE-random accuracy from 47.53 to 80.56, while retaining the backbone's original Chinese capabilities with minimal degradation, providing the first standardized foundation for Tibetan multimodal research.
Problem

Research questions and friction points this paper is trying to address.

Tibetan
vision-language modeling
low-resource language
multimodal benchmarks
training infrastructure
Innovation

Methods, ideas, or system contributions that make the work stand out.

Tibetan vision-language modeling
low-resource language
multimodal benchmark
three-stage adaptation pipeline
reproducible baseline
🔎 Similar Papers