Paint Outside the Box: Synthesizing and Selecting Training Data for Visual Grounding

📅 2024-12-01
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the scarcity and misalignment of real text-region annotations in vision-language grounding. To this end, we propose POBF—a novel framework that introduces *out-of-box inpainting*, a diffusion-based image synthesis paradigm generating high-fidelity, semantically consistent pseudo-annotations to mitigate annotation misalignment. Furthermore, POBF incorporates a dual-criterion dynamic data selection mechanism that jointly considers sample difficulty and overfitting risk, enabling quality-aware curation of training subsets from generated data. Experiments demonstrate that POBF achieves an average +5.83% improvement over models trained solely on real annotations across four standard benchmarks—surpassing state-of-the-art methods by 2.29–3.85%. Moreover, POBF exhibits strong robustness across diverse diffusion models, annotation scales, and vision-language architectures.

Technology Category

Application Category

📝 Abstract
Visual grounding aims to localize the image regions based on a textual query. Given the difficulty of large-scale data curation, we investigate how to effectively learn visual grounding under data-scarce settings in this paper. To address the data scarcity, we propose a novel framework, POBF (Paint Outside the Box and Filter). POBF synthesizes images by inpainting outside the box, tackling a label misalignment issue encountered in previous works. Furthermore, POBF leverages an innovative filtering scheme to select the most effective training data. This scheme combines a hardness score and an overfitting score, balanced by a penalty term. Extensive experiments across four benchmark datasets demonstrate that POBF consistently improves performance, achieving an average gain of 5.83% over the real-data-only method and outperforming leading baselines by 2.29%-3.85% in accuracy. Additionally, we validate the robustness and generalizability of POBF across various generative models, training data sizes, and model architectures.
Problem

Research questions and friction points this paper is trying to address.

Learning visual grounding with limited training data
Addressing label misalignment in synthesized images
Selecting optimal training data via filtering scheme
Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthesizes images by inpainting outside box
Filters training data with hardness and overfitting scores
Improves performance across multiple benchmark datasets
🔎 Similar Papers
No similar papers found.
Z
Zilin Du
Nanyang Technological University, Singapore
Haoxin Li
Haoxin Li
Nanyang Technological University
Computer VisionVision and Language
Jianfei Yu
Jianfei Yu
Singapore Management University
Natural Language ProcessingText MiningMachine Learning
B
Boyang Albert Li
Nanyang Technological University, Singapore