🤖 AI Summary
To address task interference, poor generalization, and architectural coupling in multi-task vision-language modeling, this paper proposes UniMLM—a unified, end-to-end multimodal large language model. Methodologically, it introduces (1) a “Super Link” cross-module information routing mechanism enabling dynamic gradient and feature interaction between the large language model (LLM) and plug-and-play task decoders, thereby mitigating multi-task training conflicts; (2) a unified architecture comprising a shared vision encoder, an LLM backbone, and prompt-driven decoders, supporting zero-shot task switching; and (3) large-scale, cross-task dataset construction with joint end-to-end training. Evaluated across hundreds of vision-language tasks—including visual question answering, object localization, pose estimation, image generation, and editing—UniMLM matches or approaches the performance of task-specialized models while achieving significantly improved generalization and task adaptation efficiency.
📝 Abstract
We present VisionLLM v2, an end-to-end generalist multimodal large model (MLLM) that unifies visual perception, understanding, and generation within a single framework. Unlike traditional MLLMs limited to text output, VisionLLM v2 significantly broadens its application scope. It excels not only in conventional visual question answering (VQA) but also in open-ended, cross-domain vision tasks such as object localization, pose estimation, and image generation and editing. To this end, we propose a new information transmission mechanism termed"super link", as a medium to connect MLLM with task-specific decoders. It not only allows flexible transmission of task information and gradient feedback between the MLLM and multiple downstream decoders but also effectively resolves training conflicts in multi-tasking scenarios. In addition, to support the diverse range of tasks, we carefully collected and combed training data from hundreds of public vision and vision-language tasks. In this way, our model can be joint-trained end-to-end on hundreds of vision language tasks and generalize to these tasks using a set of shared parameters through different user prompts, achieving performance comparable to task-specific models. We believe VisionLLM v2 will offer a new perspective on the generalization of MLLMs.