Towards Unified Referring Expression Segmentation Across Omni-Level Visual Target Granularities

📅 2025-04-02

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

Existing referring expression segmentation (RES) methods are confined to object-level localization, struggling to handle multi-granularity linguistic descriptions—such as part-level, single-object-level, and multi-object-level references—due to the lack of dedicated benchmarks and unified modeling frameworks. To address this, we introduce multi-granularity RES as a new task, construct MRES-32M—the first large-scale, part-level vision-language dataset containing 32.2 million high-quality segmentation masks—and propose UniRES++, an end-to-end unified model built upon a multimodal large language model architecture. UniRES++ incorporates fine-grained visual feature disentanglement and cross-granularity joint training to enable precise mask–text alignment. Extensive experiments on RefCOCO+, gRefCOCO, and the full RefCOCO benchmark suite demonstrate state-of-the-art performance across all granularities, marking the first work to achieve holistic, high-accuracy RES—from parts to scenes—within a single framework.

Technology Category

Application Category

📝 Abstract

Referring expression segmentation (RES) aims at segmenting the entities' masks that match the descriptive language expression. While traditional RES methods primarily address object-level grounding, real-world scenarios demand a more versatile framework that can handle multiple levels of target granularity, such as multi-object, single object or part-level references. This introduces great challenges due to the diverse and nuanced ways users describe targets. However, existing datasets and models mainly focus on designing grounding specialists for object-level target localization, lacking the necessary data resources and unified frameworks for the more practical multi-grained RES. In this paper, we take a step further towards visual granularity unified RES task. To overcome the limitation of data scarcity, we introduce a new multi-granularity referring expression segmentation (MRES) task, alongside the RefCOCOm benchmark, which includes part-level annotations for advancing finer-grained visual understanding. In addition, we create MRES-32M, the largest visual grounding dataset, comprising over 32.2M masks and captions across 1M images, specifically designed for part-level vision-language grounding. To tackle the challenges of multi-granularity RES, we propose UniRES++, a unified multimodal large language model that integrates object-level and part-level RES tasks. UniRES++ incorporates targeted designs for fine-grained visual feature exploration. With the joint model architecture and parameters, UniRES++ achieves state-of-the-art performance across multiple benchmarks, including RefCOCOm for MRES, gRefCOCO for generalized RES, and RefCOCO, RefCOCO+, RefCOCOg for classic RES. To foster future research into multi-grained visual grounding, our RefCOCOm benchmark, MRES-32M dataset and model UniRES++ will be publicly available at https://github.com/Rubics-Xuan/MRES.

Problem

Research questions and friction points this paper is trying to address.

Unified segmentation for multi-granularity visual targets

Overcoming data scarcity in part-level referring expression segmentation

Integrating object-level and part-level RES tasks in one model

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces multi-granularity referring expression segmentation task

Proposes unified multimodal large language model UniRES++

Creates largest part-level dataset MRES-32M

🔎 Similar Papers

No similar papers found.