Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks

📅 2025-01-14

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

To address insufficient region-level understanding in image and video comprehension, this paper introduces RegVLM, a unified multimodal large language model. Methodologically, it (1) proposes the novel Token Mark mechanism, directly embedding spatial/temporal region prompts (e.g., bounding boxes, masks) into visual and textual tokens to enable cross-modal region-aligned representation; (2) designs a trajectory-free video region consistency auxiliary task to enhance robust temporal modeling; and (3) constructs RegVID-300k—the first large-scale video region-level instruction dataset. RegVLM jointly optimizes region prompt embedding, vision-language joint tokenization, temporal consistency modeling, and instruction tuning. Evaluated on benchmarks for image/video commonsense reasoning, RegVLM achieves state-of-the-art performance. Moreover, it significantly outperforms existing methods on region captioning and referring expression comprehension tasks.

Technology Category

Application Category

📝 Abstract

We present Omni-RGPT, a multimodal large language model designed to facilitate region-level comprehension for both images and videos. To achieve consistent region representation across spatio-temporal dimensions, we introduce Token Mark, a set of tokens highlighting the target regions within the visual feature space. These tokens are directly embedded into spatial regions using region prompts (e.g., boxes or masks) and simultaneously incorporated into the text prompt to specify the target, establishing a direct connection between visual and text tokens. To further support robust video understanding without requiring tracklets, we introduce an auxiliary task that guides Token Mark by leveraging the consistency of the tokens, enabling stable region interpretation across the video. Additionally, we introduce a large-scale region-level video instruction dataset (RegVID-300k). Omni-RGPT achieves state-of-the-art results on image and video-based commonsense reasoning benchmarks while showing strong performance in captioning and referring expression comprehension tasks.

Problem

Research questions and friction points this paper is trying to address.

Image Understanding

Video Analysis

Region Information Parsing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Omni-RGPT

Token Mark

Video Understanding

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs