Paint Outside the Box: Synthesizing and Selecting Training Data for Visual Grounding

📅 2024-12-01

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This work addresses the scarcity and misalignment of real text-region annotations in vision-language grounding. To this end, we propose POBF—a novel framework that introduces *out-of-box inpainting*, a diffusion-based image synthesis paradigm generating high-fidelity, semantically consistent pseudo-annotations to mitigate annotation misalignment. Furthermore, POBF incorporates a dual-criterion dynamic data selection mechanism that jointly considers sample difficulty and overfitting risk, enabling quality-aware curation of training subsets from generated data. Experiments demonstrate that POBF achieves an average +5.83% improvement over models trained solely on real annotations across four standard benchmarks—surpassing state-of-the-art methods by 2.29–3.85%. Moreover, POBF exhibits strong robustness across diverse diffusion models, annotation scales, and vision-language architectures.

Technology Category

Application Category

📝 Abstract

Visual grounding aims to localize the image regions based on a textual query. Given the difficulty of large-scale data curation, we investigate how to effectively learn visual grounding under data-scarce settings in this paper. To address the data scarcity, we propose a novel framework, POBF (Paint Outside the Box and Filter). POBF synthesizes images by inpainting outside the box, tackling a label misalignment issue encountered in previous works. Furthermore, POBF leverages an innovative filtering scheme to select the most effective training data. This scheme combines a hardness score and an overfitting score, balanced by a penalty term. Extensive experiments across four benchmark datasets demonstrate that POBF consistently improves performance, achieving an average gain of 5.83% over the real-data-only method and outperforming leading baselines by 2.29%-3.85% in accuracy. Additionally, we validate the robustness and generalizability of POBF across various generative models, training data sizes, and model architectures.

Problem

Research questions and friction points this paper is trying to address.

Learning visual grounding with limited training data

Addressing label misalignment in synthesized images

Selecting optimal training data via filtering scheme

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthesizes images by inpainting outside box

Filters training data with hardness and overfitting scores

Improves performance across multiple benchmark datasets

🔎 Similar Papers

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling