VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

📅 2024-06-12
🏛️ arXiv.org
📈 Citations: 20
Influential: 2
📄 PDF
🤖 AI Summary
To address task interference, poor generalization, and architectural coupling in multi-task vision-language modeling, this paper proposes UniMLM—a unified, end-to-end multimodal large language model. Methodologically, it introduces (1) a “Super Link” cross-module information routing mechanism enabling dynamic gradient and feature interaction between the large language model (LLM) and plug-and-play task decoders, thereby mitigating multi-task training conflicts; (2) a unified architecture comprising a shared vision encoder, an LLM backbone, and prompt-driven decoders, supporting zero-shot task switching; and (3) large-scale, cross-task dataset construction with joint end-to-end training. Evaluated across hundreds of vision-language tasks—including visual question answering, object localization, pose estimation, image generation, and editing—UniMLM matches or approaches the performance of task-specialized models while achieving significantly improved generalization and task adaptation efficiency.

Technology Category

Application Category

📝 Abstract
We present VisionLLM v2, an end-to-end generalist multimodal large model (MLLM) that unifies visual perception, understanding, and generation within a single framework. Unlike traditional MLLMs limited to text output, VisionLLM v2 significantly broadens its application scope. It excels not only in conventional visual question answering (VQA) but also in open-ended, cross-domain vision tasks such as object localization, pose estimation, and image generation and editing. To this end, we propose a new information transmission mechanism termed"super link", as a medium to connect MLLM with task-specific decoders. It not only allows flexible transmission of task information and gradient feedback between the MLLM and multiple downstream decoders but also effectively resolves training conflicts in multi-tasking scenarios. In addition, to support the diverse range of tasks, we carefully collected and combed training data from hundreds of public vision and vision-language tasks. In this way, our model can be joint-trained end-to-end on hundreds of vision language tasks and generalize to these tasks using a set of shared parameters through different user prompts, achieving performance comparable to task-specific models. We believe VisionLLM v2 will offer a new perspective on the generalization of MLLMs.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Learning
Visual Question Answering
Object Localization
Innovation

Methods, ideas, or system contributions that make the work stand out.

VisionLLM v2
superprompt mechanism
visual-linguistic integration
🔎 Similar Papers
No similar papers found.