REF-VLM: Triplet-Based Referring Paradigm for Unified Visual Decoding

📅 2025-03-10

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

To address the misalignment between vision and language representations and poor generalization across multi-task and multi-granularity dense prediction tasks (e.g., semantic segmentation, keypoint detection) in multimodal large language models (MLLMs), this paper proposes UniVid, a unified visual decoding framework. Methodologically, it introduces: (1) the Triplet Referencing Paradigm (TRP), the first explicit decomposition of concepts, decoding types, and targets; (2) VTInstruct—the first million-scale multimodal visual instruction dataset supporting hybrid visual prompts (points, boxes, scribbles, masks) and corresponding outputs; and (3) a symbol-delimiter-driven structured representation learning scheme with end-to-end visual token generation. Evaluated on over ten dense prediction benchmarks—including semantic segmentation, pose estimation, and depth prediction—UniVid consistently outperforms state-of-the-art MLLMs. It further enables zero-shot cross-task transfer, producing outputs with strong parseability and interpretability.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) demonstrate robust zero-shot capabilities across diverse vision-language tasks after training on mega-scale datasets. However, dense prediction tasks, such as semantic segmentation and keypoint detection, pose significant challenges for MLLMs when represented solely as text outputs. Simultaneously, current MLLMs utilizing latent embeddings for visual task decoding generally demonstrate limited adaptability to both multi-task learning and multi-granularity scenarios. In this work, we present REF-VLM, an end-to-end framework for unified training of various visual decoding tasks. To address complex visual decoding scenarios, we introduce the Triplet-Based Referring Paradigm (TRP), which explicitly decouples three critical dimensions in visual decoding tasks through a triplet structure: concepts, decoding types, and targets. TRP employs symbolic delimiters to enforce structured representation learning, enhancing the parsability and interpretability of model outputs. Additionally, we construct Visual-Task Instruction Following Dataset (VTInstruct), a large-scale multi-task dataset containing over 100 million multimodal dialogue samples across 25 task types. Beyond text inputs and outputs, VT-Instruct incorporates various visual prompts such as point, box, scribble, and mask, and generates outputs composed of text and visual units like box, keypoint, depth and mask. The combination of different visual prompts and visual units generates a wide variety of task types, expanding the applicability of REF-VLM significantly. Both qualitative and quantitative experiments demonstrate that our REF-VLM outperforms other MLLMs across a variety of standard benchmarks. The code, dataset, and demo available at https://github.com/MacavityT/REF-VLM.

Problem

Research questions and friction points this paper is trying to address.

Addresses challenges in dense prediction tasks for MLLMs.

Enhances adaptability to multi-task and multi-granularity scenarios.

Introduces a unified framework for diverse visual decoding tasks.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Triplet-Based Referring Paradigm for visual decoding

Visual-Task Instruction Following Dataset with 100M samples

Unified framework for multi-task and multi-granularity scenarios

🔎 Similar Papers

ClawMachine: Learning to Fetch Visual Tokens for Referential Comprehension