Vision-Language Model Dialog Games for Self-Improvement

📅 2025-02-04

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

The scarcity of high-quality multimodal training data severely constrains the development of vision-language models (VLMs). Method: This paper proposes a target-driven, closed-loop data generation framework based on dual-agent self-play. Two VLM agents—termed “Describer” and “Verifier”—engage in goal-oriented dialogic博弈 centered on image recognition tasks; successful interactions are filtered and used to synthesize high-fidelity interleaved image-text data, which then undergoes supervised fine-tuning—all without human annotation. Contribution/Results: We introduce the first VLM self-play dialogue game paradigm, establishing a self-improving “generate–evaluate–optimize” loop. Experiments demonstrate that iterative training significantly enhances downstream task performance and cross-dataset generalization. Moreover, the quality of self-play improves synergistically with model upgrades, effectively alleviating the multimodal data bottleneck.

Technology Category

Application Category

📝 Abstract

The increasing demand for high-quality, diverse training data poses a significant bottleneck in advancing vision-language models (VLMs). This paper presents VLM Dialog Games, a novel and scalable self-improvement framework for VLMs. Our approach leverages self-play between two agents engaged in a goal-oriented play centered around image identification. By filtering for successful game interactions, we automatically curate a high-quality dataset of interleaved images and text. We demonstrate that fine-tuning on this synthetic data leads to performance gains on downstream tasks and generalises across datasets. Moreover, as the improvements in the model lead to better game play, this procedure can be applied iteratively. This work paves the way for self-improving VLMs, with potential applications in various real-world scenarios especially when the high-quality multimodal data is scarce.

Problem

Research questions and friction points this paper is trying to address.

Addresses high-quality training data scarcity

Enhances vision-language models through dialog games

Iteratively improves model performance via self-play

Innovation

Methods, ideas, or system contributions that make the work stand out.

VLM Dialog Games

self-play between agents

synthetic data fine-tuning

🔎 Similar Papers

Self-playing Adversarial Language Game Enhances LLM Reasoning