FTibSuite: A Comprehensive Resource Suite for Tibetan Vision-Language Modeling

📅 2026-05-26

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Tibetan, a low-resource language, has long lacked standardized infrastructure for vision–language research. This work presents the first comprehensive multimodal research suite for Tibetan, comprising a high-quality image–text dataset and five evaluation benchmarks, including a Tibetan-adapted MMBench. The authors propose a three-stage adaptation strategy based on Qwen3-VL-8B-Instruct—continuous pretraining, image–text alignment, and instruction fine-tuning—augmented with hierarchical quality control to mitigate translation-induced noise. Experimental results demonstrate substantial performance gains on FTibBench: MMBench accuracy improves from 42.97% to 67.78%, and POPE-random rises from 47.53% to 80.56%, while preserving the model’s original Chinese capabilities. This study delivers a reproducible, end-to-end toolkit that advances multimodal research for Tibetan.

📝 Abstract

Vision-language models have progressed rapidly, but Tibetan remains a severely underserved low-resource language due to the lack of reproducible training and evaluation infrastructure. To fill this gap, we introduce FTibSuite, a comprehensive resource suite for Tibetan vision-language research, consisting of FTibData (human-verified multimodal training corpora spanning continual pretraining, image-text alignment, and instruction tuning data), FTibBench (Tibetan adaptations of five mainstream multimodal benchmarks with a hierarchical quality-control workflow to reduce translation noise), and FTibVLM, a reproducible baseline built on Qwen3-VL-8B-Instruct via a three-stage adaptation pipeline. Experiments on FTibBench show FTibVLM delivers consistent performance gains across all tasks, such as improving MMBench accuracy from 42.97 to 67.78 and POPE-random accuracy from 47.53 to 80.56, while retaining the backbone's original Chinese capabilities with minimal degradation, providing the first standardized foundation for Tibetan multimodal research.

Problem

Research questions and friction points this paper is trying to address.

Tibetan

vision-language modeling

low-resource language

multimodal benchmarks

training infrastructure

Innovation

Methods, ideas, or system contributions that make the work stand out.

Tibetan vision-language modeling

low-resource language

multimodal benchmark