ColLab: A Collaborative Spatial Progressive Data Engine for Referring Expression Comprehension and Generation

πŸ“… 2025-09-28
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing referring expression comprehension (REC) and generation (REG) datasets rely heavily on costly, labor-intensive manual annotation, limiting scalability and diversity. To address this, we propose a fully unsupervised, collaborative spatially progressive data engine that leverages multimodal large language models (MLLMs) and large language models (LLMs) in an interactive, iterative framework. Integrated with a spatially progressive enhancement module, our method automatically synthesizes high-quality referring expressions that are semantically rich and spatially discriminative. This approach significantly improves data diversity and reasoning complexity while enabling fully automated, efficient dataset construction. Experimental evaluation demonstrates that the generated data surpasses human-annotated baselines in both quality and spatial discriminability. The method has been successfully deployed to construct the official dataset for the ICCV 2025 MARS2 Challenge.

Technology Category

Application Category

πŸ“ Abstract
Referring Expression Comprehension (REC) and Referring Expression Generation (REG) are fundamental tasks in multimodal understanding, supporting precise object localization through natural language. However, existing REC and REG datasets rely heavily on manual annotation, which is labor-intensive and difficult to scale. In this paper, we propose ColLab, a collaborative spatial progressive data engine that enables fully automated REC and REG data generation without human supervision. Specifically, our method introduces a Collaborative Multimodal Model Interaction (CMMI) strategy, which leverages the semantic understanding of multimodal large language models (MLLMs) and large language models (LLMs) to generate descriptions. Furthermore, we design a module termed Spatial Progressive Augmentation (SPA) to enhance spatial expressiveness among duplicate instances. Experiments demonstrate that ColLab significantly accelerates the annotation process of REC and REG while improving the quality and discriminability of the generated expressions. In addition to the core methodological contribution, our framework was partially adopted in the data generation pipeline of the ICCV 2025 MARS2 Challenge on Multimodal Reasoning, enriching the dataset with diverse and challenging samples that better reflect real-world reasoning demands.
Problem

Research questions and friction points this paper is trying to address.

Automating REC and REG data generation without human supervision
Enhancing spatial expressiveness among duplicate object instances
Accelerating annotation process while improving expression quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated data generation without human supervision
Collaborative multimodal model interaction strategy
Spatial progressive augmentation for enhanced expressiveness
πŸ”Ž Similar Papers
No similar papers found.
S
Shilan Zhang
Wuhan University of Technology
J
Jirui Huang
Wuhan University of Technology
R
Ruilin Yao
Wuhan University of Technology
C
Cong Wang
Northwestern Polytechnical University
Yaxiong Chen
Yaxiong Chen
Wuhan University of Technology
deep hashing、deep learning
P
Peng Xu
Tsinghua University
Shengwu Xiong
Shengwu Xiong
Wuhan University of Technology
Artificial Intelligence