UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning

📅 2025-09-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large multimodal models (LMMs) excel at holistic image/video understanding but lack pixel-level vision-language alignment, limiting their capability for referring expression segmentation, fine-grained localization, and multi-step reasoning. To address this, we propose the first unified large multimodal model that jointly supports pixel-level perception and general visual understanding. Our method introduces visual prompt encoding, mask generation, and a conditional reasoning module, with an intermediate mask pointer serving as a bridge between linguistic semantics and pixel-level signals—enabling a closed-loop, fine-grained vision-language understanding pipeline: “input visual prompt → generate mask → perform subsequent reasoning conditioned on the mask.” The model natively supports joint referring, segmentation, and question answering. It achieves state-of-the-art performance on 10 pixel-level image/video benchmarks and, for the first time, introduces the PixelQA task to rigorously evaluate its multi-step reasoning capability.

Technology Category

Application Category

📝 Abstract
Recent advances in Large Multi-modal Models (LMMs) have demonstrated their remarkable success as general-purpose multi-modal assistants, with particular focuses on holistic image- and video-language understanding. Conversely, less attention has been given to scaling fine-grained pixel-level understanding capabilities, where the models are expected to realize pixel-level alignment between visual signals and language semantics. Some previous studies have applied LMMs to related tasks such as region-level captioning and referring expression segmentation. However, these models are limited to performing either referring or segmentation tasks independently and fail to integrate these fine-grained perception capabilities into visual reasoning. To bridge this gap, we propose UniPixel, a large multi-modal model capable of flexibly comprehending visual prompt inputs and generating mask-grounded responses. Our model distinguishes itself by seamlessly integrating pixel-level perception with general visual understanding capabilities. Specifically, UniPixel processes visual prompts and generates relevant masks on demand, and performs subsequent reasoning conditioning on these intermediate pointers during inference, thereby enabling fine-grained pixel-level reasoning. The effectiveness of our approach has been verified on 10 benchmarks across a diverse set of tasks, including pixel-level referring/segmentation and object-centric understanding in images/videos. A novel PixelQA task that jointly requires referring, segmentation, and question answering is also designed to verify the flexibility of our method.
Problem

Research questions and friction points this paper is trying to address.

Bridging the gap between pixel-level perception and visual reasoning capabilities
Integrating referring and segmentation tasks into unified visual understanding
Enabling fine-grained pixel-level alignment between visual signals and language
Innovation

Methods, ideas, or system contributions that make the work stand out.

UniPixel integrates pixel-level perception with visual understanding
Model processes visual prompts and generates masks on demand
Performs reasoning conditioned on intermediate mask pointers
🔎 Similar Papers
No similar papers found.