🤖 AI Summary
The scarcity of high-quality multimodal training data severely constrains the development of vision-language models (VLMs).
Method: This paper proposes a target-driven, closed-loop data generation framework based on dual-agent self-play. Two VLM agents—termed “Describer” and “Verifier”—engage in goal-oriented dialogic博弈 centered on image recognition tasks; successful interactions are filtered and used to synthesize high-fidelity interleaved image-text data, which then undergoes supervised fine-tuning—all without human annotation.
Contribution/Results: We introduce the first VLM self-play dialogue game paradigm, establishing a self-improving “generate–evaluate–optimize” loop. Experiments demonstrate that iterative training significantly enhances downstream task performance and cross-dataset generalization. Moreover, the quality of self-play improves synergistically with model upgrades, effectively alleviating the multimodal data bottleneck.
📝 Abstract
The increasing demand for high-quality, diverse training data poses a significant bottleneck in advancing vision-language models (VLMs). This paper presents VLM Dialog Games, a novel and scalable self-improvement framework for VLMs. Our approach leverages self-play between two agents engaged in a goal-oriented play centered around image identification. By filtering for successful game interactions, we automatically curate a high-quality dataset of interleaved images and text. We demonstrate that fine-tuning on this synthetic data leads to performance gains on downstream tasks and generalises across datasets. Moreover, as the improvements in the model lead to better game play, this procedure can be applied iteratively. This work paves the way for self-improving VLMs, with potential applications in various real-world scenarios especially when the high-quality multimodal data is scarce.