Towards Unified Referring Expression Segmentation Across Omni-Level Visual Target Granularities

📅 2025-04-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing referring expression segmentation (RES) methods are confined to object-level localization, struggling to handle multi-granularity linguistic descriptions—such as part-level, single-object-level, and multi-object-level references—due to the lack of dedicated benchmarks and unified modeling frameworks. To address this, we introduce multi-granularity RES as a new task, construct MRES-32M—the first large-scale, part-level vision-language dataset containing 32.2 million high-quality segmentation masks—and propose UniRES++, an end-to-end unified model built upon a multimodal large language model architecture. UniRES++ incorporates fine-grained visual feature disentanglement and cross-granularity joint training to enable precise mask–text alignment. Extensive experiments on RefCOCO+, gRefCOCO, and the full RefCOCO benchmark suite demonstrate state-of-the-art performance across all granularities, marking the first work to achieve holistic, high-accuracy RES—from parts to scenes—within a single framework.

Technology Category

Application Category

📝 Abstract
Referring expression segmentation (RES) aims at segmenting the entities' masks that match the descriptive language expression. While traditional RES methods primarily address object-level grounding, real-world scenarios demand a more versatile framework that can handle multiple levels of target granularity, such as multi-object, single object or part-level references. This introduces great challenges due to the diverse and nuanced ways users describe targets. However, existing datasets and models mainly focus on designing grounding specialists for object-level target localization, lacking the necessary data resources and unified frameworks for the more practical multi-grained RES. In this paper, we take a step further towards visual granularity unified RES task. To overcome the limitation of data scarcity, we introduce a new multi-granularity referring expression segmentation (MRES) task, alongside the RefCOCOm benchmark, which includes part-level annotations for advancing finer-grained visual understanding. In addition, we create MRES-32M, the largest visual grounding dataset, comprising over 32.2M masks and captions across 1M images, specifically designed for part-level vision-language grounding. To tackle the challenges of multi-granularity RES, we propose UniRES++, a unified multimodal large language model that integrates object-level and part-level RES tasks. UniRES++ incorporates targeted designs for fine-grained visual feature exploration. With the joint model architecture and parameters, UniRES++ achieves state-of-the-art performance across multiple benchmarks, including RefCOCOm for MRES, gRefCOCO for generalized RES, and RefCOCO, RefCOCO+, RefCOCOg for classic RES. To foster future research into multi-grained visual grounding, our RefCOCOm benchmark, MRES-32M dataset and model UniRES++ will be publicly available at https://github.com/Rubics-Xuan/MRES.
Problem

Research questions and friction points this paper is trying to address.

Unified segmentation for multi-granularity visual targets
Overcoming data scarcity in part-level referring expression segmentation
Integrating object-level and part-level RES tasks in one model
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces multi-granularity referring expression segmentation task
Proposes unified multimodal large language model UniRES++
Creates largest part-level dataset MRES-32M
🔎 Similar Papers
No similar papers found.
J
Jing Liu
Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China; School of Artificial Intelligence, University of Chinese Academy of Science, Beijing 100190, China
W
Wenxuan Wang
Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China; Beijing Academy of Artificial Intelligence, Beijing 100085, China
Y
Yisi Zhang
University of Science and Technology Beijing, Beijing 100083, China
Yepeng Tang
Yepeng Tang
Beijing Jiaotong University
VideoLLMVideo Understanding
Xingjian He
Xingjian He
Institute of Automation of the Chinese Academy Sciences (CASIA)
computer visionsemantic segmentation
L
Longteng Guo
Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
Tongtian Yue
Tongtian Yue
Institute of Automation, Chinese Academy of Sciences
Multimodal PretrainVisual-Language
X
Xinlong Wang
Beijing Academy of Artificial Intelligence, Beijing 100085, China