Vision Generalist Model: A Survey

📅 2025-06-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision foundation models face significant challenges stemming from strong input/output heterogeneity and the absence of a unified modeling framework. To address these, this paper systematically surveys over one hundred works and proposes a comprehensive taxonomy spanning task categories, data foundations, architectural designs, and evaluation benchmarks—revealing fundamental distinctions from NLP foundation models. We introduce the “cross-domain co-evolution” research paradigm, integrating key techniques including multi-task joint training, unified sequence-based representation, vision instruction tuning, and cross-dataset pretraining—applicable to ViT, MoE, and diffusion-augmented architectures. Furthermore, we identify critical performance bottlenecks and evaluation biases, advocating for a more equitable and standardized assessment protocol for visual generalization. Finally, we provide actionable, industrially viable design guidelines for building production-ready multi-task vision systems.

Technology Category

Application Category

📝 Abstract
Recently, we have witnessed the great success of the generalist model in natural language processing. The generalist model is a general framework trained with massive data and is able to process various downstream tasks simultaneously. Encouraged by their impressive performance, an increasing number of researchers are venturing into the realm of applying these models to computer vision tasks. However, the inputs and outputs of vision tasks are more diverse, and it is difficult to summarize them as a unified representation. In this paper, we provide a comprehensive overview of the vision generalist models, delving into their characteristics and capabilities within the field. First, we review the background, including the datasets, tasks, and benchmarks. Then, we dig into the design of frameworks that have been proposed in existing research, while also introducing the techniques employed to enhance their performance. To better help the researchers comprehend the area, we take a brief excursion into related domains, shedding light on their interconnections and potential synergies. To conclude, we provide some real-world application scenarios, undertake a thorough examination of the persistent challenges, and offer insights into possible directions for future research endeavors.
Problem

Research questions and friction points this paper is trying to address.

Exploring vision generalist models for diverse computer vision tasks
Addressing challenges in unifying vision task representations
Surveying frameworks and techniques for vision generalist model performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

General framework for diverse vision tasks
Massive data training for unified representation
Performance enhancement techniques review
🔎 Similar Papers
No similar papers found.
Z
Ziyi Wang
Department of Automation, Tsinghua University, China
Yongming Rao
Yongming Rao
Tencent Hunyuan
computer visiondeep learning
S
Shuofeng Sun
Beijing University of Posts and Telecommunications, China
X
Xinrun Liu
University of Science and Technology Beijing, China
Y
Yi Wei
Department of Automation, Tsinghua University, China
Xumin Yu
Xumin Yu
Tencent Hunyuan
computer vision
Zuyan Liu
Zuyan Liu
Tsinghua University
multi-modalcomputer vision
Y
Yanbo Wang
Department of Automation, Tsinghua University, China
Hongmin Liu
Hongmin Liu
University of Science and Technology Beijing, China
J
Jie Zhou
Department of Automation, Tsinghua University, China
J
Jiwen Lu
Department of Automation, Tsinghua University, China