Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision

📅 2026-01-27
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the prevalent text-dominant bias in existing vision-language models (VLMs), where visual signals are treated merely as passive inputs, leading to the loss of fine-grained visual details and coarse-grained multimodal understanding. To overcome this limitation, we propose Youtu-VL, a novel framework that introduces the Vision-Language Unified Autoregressive Supervision (VLUAS) paradigm. VLUAS unifies visual and linguistic tokens into a single autoregressive prediction sequence, enabling visual tokens to serve as prediction targets rather than just contextual inputs. This approach breaks away from conventional text-centric training paradigms and supports a wide range of vision-centric tasks without task-specific customization. Extensive experiments demonstrate that Youtu-VL achieves competitive performance on both general multimodal benchmarks and vision-intensive tasks, significantly enhancing visual detail preservation and joint multimodal modeling capabilities.

Technology Category

Application Category

📝 Abstract
Despite the significant advancements represented by Vision-Language Models (VLMs), current architectures often exhibit limitations in retaining fine-grained visual information, leading to coarse-grained multimodal comprehension. We attribute this deficiency to a suboptimal training paradigm inherent in prevailing VLMs, which exhibits a text-dominant optimization bias by conceptualizing visual signals merely as passive conditional inputs rather than supervisory targets. To mitigate this, we introduce Youtu-VL, a framework leveraging the Vision-Language Unified Autoregressive Supervision (VLUAS) paradigm, which fundamentally shifts the optimization objective from ``vision-as-input''to ``vision-as-target.''By integrating visual tokens directly into the prediction stream, Youtu-VL applies unified autoregressive supervision to both visual details and linguistic content. Furthermore, we extend this paradigm to encompass vision-centric tasks, enabling a standard VLM to perform vision-centric tasks without task-specific additions. Extensive empirical evaluations demonstrate that Youtu-VL achieves competitive performance on both general multimodal tasks and vision-centric tasks, establishing a robust foundation for the development of comprehensive generalist visual agents.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models
fine-grained visual information
text-dominant bias
multimodal comprehension
visual supervision
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language Unified Autoregressive Supervision
vision-as-target
visual tokens
multimodal comprehension
generalist visual agents
Z
Zhixiang Wei
Y
Yi Li
Zhehan Kan
Zhehan Kan
PhD student, Tsinghua University
CVMLLMsLLMs
Xinghua Jiang
Xinghua Jiang
Tencent Youtu Lab
Z
Zuwei Long
S
Shifeng Liu
H
Hongze Shen
W
Wei Liu
X
Xiaoyu Tan
Haojia Lin
Haojia Lin
Tencent Youtu Lab
Y
Yubo Zhu
Q
Qianyu Li
Di Yin
Di Yin
Tencent
LLMNLPMLLM
H
Haoyu Cao
W
Weibo Gu
X
Xin Li
Y
Yinsong Liu
Deqiang Jiang
Deqiang Jiang
腾讯优图实验室
Xing Sun
Xing Sun
Tencent Youtu Lab
LLMMLLMAgent
Y
Yunsheng Wu
M
Mingkong Tang
S
Shuangyin Liu
L
Le Tang
H
Haodong Lin
Junru Lu
Junru Lu
University of Warwick
natural language processingquestion answering
Jiarui Qin
Jiarui Qin
Tencent
Large Language ModelRecommender SystemsInformation Retrieval
L
Li-Xian Qiao
Ruizhi Qiao
Ruizhi Qiao
Tencent Youtu Lab
Artificial intelligence
B
Bo Ke
Jianfeng He
Jianfeng He
Virginia Tech
Uncertainty EstimationTrustworthy NLU & NLGCross-modal RetrievalImage Manipulation
K
Ke Li
Y
Yangning Li
Y
Yu-Hong Shen
M
Meng-zhen Zhang
Peixian Chen
Peixian Chen
Youtu Lab Tencent
K
Kun Yin
B
Bing Liu
Y
Yun-Jiao Wu
H
Huang Chen
Z
Zhongpeng Cai
X
Xiaotian Li