GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding

๐Ÿ“… 2025-03-13
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing referring expression segmentation datasets suffer from three major bottlenecks: narrow category coverage, limited linguistic diversity in expressions, and low annotation quality. To address these, this work introduces the first benchmark suite for multi-granularity pixel-level vision-language grounding. We propose a novel multi-VLM agent collaboration framework for automated annotation, achieving 4.5ร— higher annotation efficiency than GLaMM while significantly improving accuracy. We release a large-scale training set comprising 9.56 million expression-segmentation pairs and a high-quality evaluation benchmark of 3,800 imagesโ€”covering fine-grained localization and cross-scene generalization. Our model achieves state-of-the-art performance with 68.9 cIoU on gRefCOCO and 55.3 gIoU on RefCOCOm, substantially outperforming prior methods. This work establishes a scalable, high-fidelity, and multi-granularity paradigm for pixel-level vision-language understanding.

Technology Category

Application Category

๐Ÿ“ Abstract
Pixel grounding, encompassing tasks such as Referring Expression Segmentation (RES), has garnered considerable attention due to its immense potential for bridging the gap between vision and language modalities. However, advancements in this domain are currently constrained by limitations inherent in existing datasets, including limited object categories, insufficient textual diversity, and a scarcity of high-quality annotations. To mitigate these limitations, we introduce GroundingSuite, which comprises: (1) an automated data annotation framework leveraging multiple Vision-Language Model (VLM) agents; (2) a large-scale training dataset encompassing 9.56 million diverse referring expressions and their corresponding segmentations; and (3) a meticulously curated evaluation benchmark consisting of 3,800 images. The GroundingSuite training dataset facilitates substantial performance improvements, enabling models trained on it to achieve state-of-the-art results. Specifically, a cIoU of 68.9 on gRefCOCO and a gIoU of 55.3 on RefCOCOm. Moreover, the GroundingSuite annotation framework demonstrates superior efficiency compared to the current leading data annotation method, i.e., $4.5 imes$ faster than the GLaMM.
Problem

Research questions and friction points this paper is trying to address.

Limited object categories in existing datasets
Insufficient textual diversity in current datasets
Scarcity of high-quality annotations for pixel grounding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated data annotation using Vision-Language Models
Large-scale dataset with diverse referring expressions
Curated benchmark for precise evaluation
๐Ÿ”Ž Similar Papers
No similar papers found.
R
Rui Hu
School of EIC, Huazhong University of Science & Technology
L
Lianghui Zhu
School of EIC, Huazhong University of Science & Technology
Y
Yuxuan Zhang
School of EIC, Huazhong University of Science & Technology
Tianheng Cheng
Tianheng Cheng
ByteDance Seed
Computer VisionObject DetectionInstance SegmentationMultimodal ModelsAutonomous Driving
L
Lei Liu
vivo AI Lab
Heng Liu
Heng Liu
Guangxi Minzu University
adaptive fuzzy controlfractional-order systemnonlinear systemrobust controlneural network
L
Longjin Ran
vivo AI Lab
Xiaoxin Chen
Xiaoxin Chen
Coriell Institute for Medical Research
W
Wenyu Liu
School of EIC, Huazhong University of Science & Technology
Xinggang Wang
Xinggang Wang
Professor, Huazhong University of Science and Technology
Artificial IntelligenceComputer VisionAutonomous DrivingObject DetectionObject Segmentation