D2AF: A Dual-Driven Annotation and Filtering Framework for Visual Grounding

πŸ“… 2025-05-30
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Visual grounding suffers from high annotation costs, limiting dataset scale and large-model performance; existing pseudo-labeling approaches rely on human-provided captions, resulting in poor generalizability and limited diversity. This paper proposes the first dual-driven (closed-set + open-set) pseudo-label generation framework for unlabeled images, automatically producing high-quality region–text pairs solely from input images. Our method synergistically integrates multimodal large language models with object detectors, introducing a dual-annotation strategy and a distribution-aware noise filtering mechanism to simultaneously scale data volume, broaden semantic coverage, and enhance domain adaptability. Evaluated on three mainstream visual grounding benchmarks, our approach significantly outperforms prior methods, achieving state-of-the-art performance and effectively alleviating the annotation bottleneck.

Technology Category

Application Category

πŸ“ Abstract
Visual Grounding is a task that aims to localize a target region in an image based on a free-form natural language description. With the rise of Transformer architectures, there is an increasing need for larger datasets to boost performance. However, the high cost of manual annotation poses a challenge, hindering the scale of data and the ability of large models to enhance their effectiveness. Previous pseudo label generation methods heavily rely on human-labeled captions of the original dataset, limiting scalability and diversity. To address this, we propose D2AF, a robust annotation framework for visual grounding using only input images. This approach overcomes dataset size limitations and enriches both the quantity and diversity of referring expressions. Our approach leverages multimodal large models and object detection models. By implementing dual-driven annotation strategies, we effectively generate detailed region-text pairs using both closed-set and open-set approaches. We further conduct an in-depth analysis of data quantity and data distribution. Our findings demonstrate that increasing data volume enhances model performance. However, the degree of improvement depends on how well the pseudo labels broaden the original data distribution. Based on these insights, we propose a consistency and distribution aware filtering method to further improve data quality by effectively removing erroneous and redundant data. This approach effectively eliminates noisy data, leading to improved performance. Experiments on three visual grounding tasks demonstrate that our method significantly improves the performance of existing models and achieves state-of-the-art results.
Problem

Research questions and friction points this paper is trying to address.

High cost of manual annotation limits visual grounding dataset scale
Existing pseudo label methods rely on human captions, reducing diversity
Noisy and redundant data in pseudo labels hinder model performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages multimodal large models for annotation
Uses dual-driven strategies for region-text pairs
Applies consistency and distribution aware filtering
πŸ”Ž Similar Papers
No similar papers found.
Y
Yichi Zhang
Harbin Institute of Technology, Shenzhen
G
Gongwei Chen
Harbin Institute of Technology, Shenzhen
J
Jun Zhu
Harbin Institute of Technology, Shenzhen
Jia Wan
Jia Wan
PhD student in EECS, MIT
statisticsreinforcement learninginferencecombinatorial optimization