🤖 AI Summary
Existing referring expression segmentation (RES) methods are confined to object-level localization, struggling to handle multi-granularity linguistic descriptions—such as part-level, single-object-level, and multi-object-level references—due to the lack of dedicated benchmarks and unified modeling frameworks. To address this, we introduce multi-granularity RES as a new task, construct MRES-32M—the first large-scale, part-level vision-language dataset containing 32.2 million high-quality segmentation masks—and propose UniRES++, an end-to-end unified model built upon a multimodal large language model architecture. UniRES++ incorporates fine-grained visual feature disentanglement and cross-granularity joint training to enable precise mask–text alignment. Extensive experiments on RefCOCO+, gRefCOCO, and the full RefCOCO benchmark suite demonstrate state-of-the-art performance across all granularities, marking the first work to achieve holistic, high-accuracy RES—from parts to scenes—within a single framework.
📝 Abstract
Referring expression segmentation (RES) aims at segmenting the entities' masks that match the descriptive language expression. While traditional RES methods primarily address object-level grounding, real-world scenarios demand a more versatile framework that can handle multiple levels of target granularity, such as multi-object, single object or part-level references. This introduces great challenges due to the diverse and nuanced ways users describe targets. However, existing datasets and models mainly focus on designing grounding specialists for object-level target localization, lacking the necessary data resources and unified frameworks for the more practical multi-grained RES. In this paper, we take a step further towards visual granularity unified RES task. To overcome the limitation of data scarcity, we introduce a new multi-granularity referring expression segmentation (MRES) task, alongside the RefCOCOm benchmark, which includes part-level annotations for advancing finer-grained visual understanding. In addition, we create MRES-32M, the largest visual grounding dataset, comprising over 32.2M masks and captions across 1M images, specifically designed for part-level vision-language grounding. To tackle the challenges of multi-granularity RES, we propose UniRES++, a unified multimodal large language model that integrates object-level and part-level RES tasks. UniRES++ incorporates targeted designs for fine-grained visual feature exploration. With the joint model architecture and parameters, UniRES++ achieves state-of-the-art performance across multiple benchmarks, including RefCOCOm for MRES, gRefCOCO for generalized RES, and RefCOCO, RefCOCO+, RefCOCOg for classic RES. To foster future research into multi-grained visual grounding, our RefCOCOm benchmark, MRES-32M dataset and model UniRES++ will be publicly available at https://github.com/Rubics-Xuan/MRES.