PixelRefer: A Unified Framework for Spatio-Temporal Object Referring with Arbitrary Granularity

📅 2025-10-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal large language models (MLLMs) primarily focus on scene-level understanding, limiting their capability for fine-grained, object-centric region-level visual reasoning. This paper introduces the first unified region-level MLLM framework capable of referring to and comprehending arbitrary user-specified regions in images and videos. Our method integrates free-form region encoding, hierarchical attention, and an object-centric injection mechanism. Key contributions include: (1) a Scale-Adaptive Object Tokenizer that generates semantically rich, multi-scale object representations; and (2) an Object-Centric Infusion module that pre-fuses global contextual information before encoding, enabling lightweight yet effective object-centered modeling. Evaluated across multiple benchmarks, our framework achieves state-of-the-art performance with significantly fewer training samples. The lightweight variant substantially reduces computational overhead while preserving near-lossless accuracy.

Technology Category

Application Category

📝 Abstract
Multimodal large language models (MLLMs) have demonstrated strong general-purpose capabilities in open-world visual comprehension. However, most existing MLLMs primarily focus on holistic, scene-level understanding, often overlooking the need for fine-grained, object-centric reasoning. In this paper, we present PixelRefer, a unified region-level MLLM framework that enables advanced fine-grained understanding over user-specified regions across both images and videos. Motivated by the observation that LLM attention predominantly focuses on object-level tokens, we propose a Scale-Adaptive Object Tokenizer (SAOT) to generate compact and semantically rich object representations from free-form regions. Our analysis reveals that global visual tokens contribute mainly in early LLM layers, inspiring the design of PixelRefer-Lite, an efficient variant that employs an Object-Centric Infusion module to pre-fuse global context into object tokens. This yields a lightweight Object-Only Framework that substantially reduces computational cost while maintaining high semantic fidelity. To facilitate fine-grained instruction tuning, we curate PixelRefer-2.2M, a high-quality object-centric instruction dataset. Extensive experiments across a range of benchmarks validate that PixelRefer achieves leading performance with fewer training samples, while PixelRefer-Lite offers competitive accuracy with notable gains in efficiency.
Problem

Research questions and friction points this paper is trying to address.

Enables fine-grained object referring across images and videos
Generates compact object representations from arbitrary regions
Reduces computational costs while maintaining semantic fidelity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Scale-Adaptive Object Tokenizer generates compact object representations
Object-Centric Infusion pre-fuses global context into object tokens
Object-Only Framework reduces computational cost while maintaining fidelity
🔎 Similar Papers
No similar papers found.
Yuqian Yuan
Yuqian Yuan
PhD student, Zhejiang University
Computer VisionMachine Learning
W
Wenqiao Zhang
Zhejiang University
X
Xin Li
DAMO Academy, Alibaba Group
S
Shihao Wang
The Hong Kong Polytechnic University
Kehan Li
Kehan Li
Stanford University
Wentong Li
Wentong Li
Nanjing University of Aeronautics and Astronautics
Computer VisionMachine LearningVision-Language ModelRobotics
J
Jun Xiao
Zhejiang University
L
Lei Zhang
The Hong Kong Polytechnic University
B
Beng Chin Ooi
Zhejiang University