Advancing Visual Large Language Model for Multi-granular Versatile Perception

📅 2025-07-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language large models (VLLMs) are largely constrained to narrow combinations of prediction types (e.g., classification, bounding box) and instruction formats, limiting generalization and task coverage. To address this, we propose MVP-LM—a unified framework supporting both token-level and sentence-level perception, as well as bounding-box and mask prediction, enabling multi-granularity, multi-task visual understanding. Our key contributions are: (1) a chain-of-thought-inspired multi-granularity decoder that explicitly models hierarchical reasoning from fine- to coarse-grained outputs; and (2) a unified supervised fine-tuning paradigm grounded in VLLM-generated supervision, integrating query enhancement and cross-task instruction alignment. Evaluated on benchmarks including panoptic segmentation, object detection, and referring expression segmentation, MVP-LM achieves substantial performance gains, demonstrating superior generalization and seamless scalability across diverse vision-language tasks.

Technology Category

Application Category

📝 Abstract
Perception is a fundamental task in the field of computer vision, encompassing a diverse set of subtasks that can be systematically categorized into four distinct groups based on two dimensions: prediction type and instruction type. Notably, existing researches often focus solely on a limited subset of these potential combinations, which constrains their applicability and versatility across various contexts. In response to this challenge, we present MVP-LM, a Multi-granular and Versatile Perception framework incorporating Visual Large Language Model. Our framework is designed to integrate both word-based and sentence-based perception tasks alongside box and mask predictions within a single architecture. MVP-LM features an innovative multi-granularity decoder in conjunction with a CoT-inspired dataset unification strategy, enabling seamless supervised fine-tuning across a wide spectrum of tasks, including but not limited to panoptic segmentation, detection, grounding, and referring expression segmentation. Furthermore, we introduce a query enhancement strategy aimed at harnessing the decoding and generative capabilities inherent in VLLMs. Extensive experiments conducted across a range of benchmarks in both word-based and sentence-based perception tasks substantiate the efficacy of our framework. The code will be available at https://github.com/xiangwentao666/MVP-LM.
Problem

Research questions and friction points this paper is trying to address.

Addresses limited versatility in visual perception tasks
Integrates word-based and sentence-based perception tasks
Enhances decoding and generative capabilities in VLLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-granular decoder for versatile perception tasks
CoT-inspired dataset unification for supervised fine-tuning
Query enhancement strategy leveraging VLLM capabilities
🔎 Similar Papers
No similar papers found.
W
Wentao Xiang
Tsinghua Shenzhen International Graduate School, Tsinghua University
H
Haoxian Tan
Meituan Inc.
Yujie Zhong
Yujie Zhong
Meituan Inc.
Computer Vision
Cong Wei
Cong Wei
University of Waterloo
ReasoningDiffusionEfficiency
D
Dengjie Li
Meituan Inc.
Yujiu Yang
Yujiu Yang
SIGS, Tsinghua University
Machine Learning, Nature language processing, Computer vision