🤖 AI Summary
This work addresses the prevalent text-dominant bias in existing vision-language models (VLMs), where visual signals are treated merely as passive inputs, leading to the loss of fine-grained visual details and coarse-grained multimodal understanding. To overcome this limitation, we propose Youtu-VL, a novel framework that introduces the Vision-Language Unified Autoregressive Supervision (VLUAS) paradigm. VLUAS unifies visual and linguistic tokens into a single autoregressive prediction sequence, enabling visual tokens to serve as prediction targets rather than just contextual inputs. This approach breaks away from conventional text-centric training paradigms and supports a wide range of vision-centric tasks without task-specific customization. Extensive experiments demonstrate that Youtu-VL achieves competitive performance on both general multimodal benchmarks and vision-intensive tasks, significantly enhancing visual detail preservation and joint multimodal modeling capabilities.
📝 Abstract
Despite the significant advancements represented by Vision-Language Models (VLMs), current architectures often exhibit limitations in retaining fine-grained visual information, leading to coarse-grained multimodal comprehension. We attribute this deficiency to a suboptimal training paradigm inherent in prevailing VLMs, which exhibits a text-dominant optimization bias by conceptualizing visual signals merely as passive conditional inputs rather than supervisory targets. To mitigate this, we introduce Youtu-VL, a framework leveraging the Vision-Language Unified Autoregressive Supervision (VLUAS) paradigm, which fundamentally shifts the optimization objective from ``vision-as-input''to ``vision-as-target.''By integrating visual tokens directly into the prediction stream, Youtu-VL applies unified autoregressive supervision to both visual details and linguistic content. Furthermore, we extend this paradigm to encompass vision-centric tasks, enabling a standard VLM to perform vision-centric tasks without task-specific additions. Extensive empirical evaluations demonstrate that Youtu-VL achieves competitive performance on both general multimodal tasks and vision-centric tasks, establishing a robust foundation for the development of comprehensive generalist visual agents.