ColLab: A Collaborative Spatial Progressive Data Engine for Referring Expression Comprehension and Generation

📅 2025-09-28

📈 Citations: 0

✨ Influential: 0

career value

152K/year

🤖 AI Summary

Existing referring expression comprehension (REC) and generation (REG) datasets rely heavily on costly, labor-intensive manual annotation, limiting scalability and diversity. To address this, we propose a fully unsupervised, collaborative spatially progressive data engine that leverages multimodal large language models (MLLMs) and large language models (LLMs) in an interactive, iterative framework. Integrated with a spatially progressive enhancement module, our method automatically synthesizes high-quality referring expressions that are semantically rich and spatially discriminative. This approach significantly improves data diversity and reasoning complexity while enabling fully automated, efficient dataset construction. Experimental evaluation demonstrates that the generated data surpasses human-annotated baselines in both quality and spatial discriminability. The method has been successfully deployed to construct the official dataset for the ICCV 2025 MARS2 Challenge.

Technology Category

Application Category

📝 Abstract

Referring Expression Comprehension (REC) and Referring Expression Generation (REG) are fundamental tasks in multimodal understanding, supporting precise object localization through natural language. However, existing REC and REG datasets rely heavily on manual annotation, which is labor-intensive and difficult to scale. In this paper, we propose ColLab, a collaborative spatial progressive data engine that enables fully automated REC and REG data generation without human supervision. Specifically, our method introduces a Collaborative Multimodal Model Interaction (CMMI) strategy, which leverages the semantic understanding of multimodal large language models (MLLMs) and large language models (LLMs) to generate descriptions. Furthermore, we design a module termed Spatial Progressive Augmentation (SPA) to enhance spatial expressiveness among duplicate instances. Experiments demonstrate that ColLab significantly accelerates the annotation process of REC and REG while improving the quality and discriminability of the generated expressions. In addition to the core methodological contribution, our framework was partially adopted in the data generation pipeline of the ICCV 2025 MARS2 Challenge on Multimodal Reasoning, enriching the dataset with diverse and challenging samples that better reflect real-world reasoning demands.

Problem

Research questions and friction points this paper is trying to address.

Automating REC and REG data generation without human supervision

Enhancing spatial expressiveness among duplicate object instances

Accelerating annotation process while improving expression quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated data generation without human supervision

Collaborative multimodal model interaction strategy

Spatial progressive augmentation for enhanced expressiveness

🔎 Similar Papers

No similar papers found.