PixelRefer: A Unified Framework for Spatio-Temporal Object Referring with Arbitrary Granularity

📅 2025-10-27

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Existing multimodal large language models (MLLMs) primarily focus on scene-level understanding, limiting their capability for fine-grained, object-centric region-level visual reasoning. This paper introduces the first unified region-level MLLM framework capable of referring to and comprehending arbitrary user-specified regions in images and videos. Our method integrates free-form region encoding, hierarchical attention, and an object-centric injection mechanism. Key contributions include: (1) a Scale-Adaptive Object Tokenizer that generates semantically rich, multi-scale object representations; and (2) an Object-Centric Infusion module that pre-fuses global contextual information before encoding, enabling lightweight yet effective object-centered modeling. Evaluated across multiple benchmarks, our framework achieves state-of-the-art performance with significantly fewer training samples. The lightweight variant substantially reduces computational overhead while preserving near-lossless accuracy.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (MLLMs) have demonstrated strong general-purpose capabilities in open-world visual comprehension. However, most existing MLLMs primarily focus on holistic, scene-level understanding, often overlooking the need for fine-grained, object-centric reasoning. In this paper, we present PixelRefer, a unified region-level MLLM framework that enables advanced fine-grained understanding over user-specified regions across both images and videos. Motivated by the observation that LLM attention predominantly focuses on object-level tokens, we propose a Scale-Adaptive Object Tokenizer (SAOT) to generate compact and semantically rich object representations from free-form regions. Our analysis reveals that global visual tokens contribute mainly in early LLM layers, inspiring the design of PixelRefer-Lite, an efficient variant that employs an Object-Centric Infusion module to pre-fuse global context into object tokens. This yields a lightweight Object-Only Framework that substantially reduces computational cost while maintaining high semantic fidelity. To facilitate fine-grained instruction tuning, we curate PixelRefer-2.2M, a high-quality object-centric instruction dataset. Extensive experiments across a range of benchmarks validate that PixelRefer achieves leading performance with fewer training samples, while PixelRefer-Lite offers competitive accuracy with notable gains in efficiency.

Problem

Research questions and friction points this paper is trying to address.

Enables fine-grained object referring across images and videos

Generates compact object representations from arbitrary regions

Reduces computational costs while maintaining semantic fidelity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Scale-Adaptive Object Tokenizer generates compact object representations

Object-Centric Infusion pre-fuses global context into object tokens

Object-Only Framework reduces computational cost while maintaining fidelity

🔎 Similar Papers

No similar papers found.