🤖 AI Summary
This work addresses the scarcity and misalignment of real text-region annotations in vision-language grounding. To this end, we propose POBF—a novel framework that introduces *out-of-box inpainting*, a diffusion-based image synthesis paradigm generating high-fidelity, semantically consistent pseudo-annotations to mitigate annotation misalignment. Furthermore, POBF incorporates a dual-criterion dynamic data selection mechanism that jointly considers sample difficulty and overfitting risk, enabling quality-aware curation of training subsets from generated data. Experiments demonstrate that POBF achieves an average +5.83% improvement over models trained solely on real annotations across four standard benchmarks—surpassing state-of-the-art methods by 2.29–3.85%. Moreover, POBF exhibits strong robustness across diverse diffusion models, annotation scales, and vision-language architectures.
📝 Abstract
Visual grounding aims to localize the image regions based on a textual query. Given the difficulty of large-scale data curation, we investigate how to effectively learn visual grounding under data-scarce settings in this paper. To address the data scarcity, we propose a novel framework, POBF (Paint Outside the Box and Filter). POBF synthesizes images by inpainting outside the box, tackling a label misalignment issue encountered in previous works. Furthermore, POBF leverages an innovative filtering scheme to select the most effective training data. This scheme combines a hardness score and an overfitting score, balanced by a penalty term. Extensive experiments across four benchmark datasets demonstrate that POBF consistently improves performance, achieving an average gain of 5.83% over the real-data-only method and outperforming leading baselines by 2.29%-3.85% in accuracy. Additionally, we validate the robustness and generalizability of POBF across various generative models, training data sizes, and model architectures.