Visual Bridge: Universal Visual Perception Representations Generating

📅 2025-11-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current diffusion models suffer from limited generalization and scalability due to the “one-task, one-model” paradigm. This paper proposes a universal visual perception framework based on flow matching, which unifies the mapping from image patch tokens to multi-task representations—including classification, detection, segmentation, and depth estimation. Our core innovations include: (i) a multi-scale cyclic task embedding mechanism that constructs a cross-task unified velocity field; and (ii) anchoring on a strong self-supervised foundation model to jointly integrate multi-scale task encodings with a generic flow matching network, enabling end-to-end generation of diverse visual representations from image tokens. Experiments demonstrate that our framework achieves state-of-the-art or expert-model–competitive performance under both zero-shot and fine-tuning settings, while significantly improving robustness, scalability, and cross-task generalization capability.

Technology Category

Application Category

📝 Abstract
Recent advances in diffusion models have achieved remarkable success in isolated computer vision tasks such as text-to-image generation, depth estimation, and optical flow. However, these models are often restricted by a ``single-task-single-model''paradigm, severely limiting their generalizability and scalability in multi-task scenarios. Motivated by the cross-domain generalization ability of large language models, we propose a universal visual perception framework based on flow matching that can generate diverse visual representations across multiple tasks. Our approach formulates the process as a universal flow-matching problem from image patch tokens to task-specific representations rather than an independent generation or regression problem. By leveraging a strong self-supervised foundation model as the anchor and introducing a multi-scale, circular task embedding mechanism, our method learns a universal velocity field to bridge the gap between heterogeneous tasks, supporting efficient and flexible representation transfer. Extensive experiments on classification, detection, segmentation, depth estimation, and image-text retrieval demonstrate that our model achieves competitive performance in both zero-shot and fine-tuned settings, outperforming prior generalist and several specialist models. Ablation studies further validate the robustness, scalability, and generalization of our framework. Our work marks a significant step towards general-purpose visual perception, providing a solid foundation for future research in universal vision modeling.
Problem

Research questions and friction points this paper is trying to address.

Developing universal visual perception framework for multi-task scenarios
Overcoming single-task limitations in current computer vision models
Generating diverse visual representations across heterogeneous tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Universal flow-matching framework for multi-task vision
Multi-scale circular task embedding mechanism
Bridges heterogeneous tasks via universal velocity field
🔎 Similar Papers
No similar papers found.
Y
Yilin Gao
Shanghai University
Shuguang Dou
Shuguang Dou
Huawei Technologies Co., Ltd.
J
Junzhou Li
University of Science and Technology of China
Z
Zhiheng Yu
Huawei Technologies Co., Ltd.
Y
Yin Li
Huawei Technologies Co., Ltd.
D
Dongsheng Jiang
Huawei Technologies Co., Ltd.
Shugong Xu
Shugong Xu
Professor at Xi'an Jiaotong-Liverpool University, IEEE Fellow
Machine LearningPattern RecognitionWireless Systems