Dataset Creation for Visual Entailment using Generative AI

📅 2025-08-15

📈 Citations: 0

✨ Influential: 0

career value

153K/year

🤖 AI Summary

Visual entailment (VE) suffers from scarcity and high annotation costs of real labeled data. Method: This paper proposes the first large-scale synthetic data generation framework for VE leveraging generative AI—specifically, it integrates the textual entailment dataset SNLI with Stable Diffusion to automatically synthesize high-quality image–text pairs, then employs CLIP features to train a lightweight VE classifier, evaluated both in-domain (SNLI-VE) and out-of-domain (SICK-VTE). Contribution/Results: Models trained solely on synthetic data achieve F-scores of 0.686 on SNLI-VE (vs. 0.703 with real data) and 0.384 on SICK-VTE (vs. 0.400), demonstrating near-real supervised performance. This work empirically validates the efficacy of generative synthetic data for visual semantic reasoning and establishes a scalable, low-cost paradigm for resource-constrained VE research.

Technology Category

Application Category

📝 Abstract

In this paper we present and validate a new synthetic dataset for training visual entailment models. Existing datasets for visual entailment are small and sparse compared to datasets for textual entailment. Manually creating datasets is labor-intensive. We base our synthetic dataset on the SNLI dataset for textual entailment. We take the premise text from SNLI as input prompts in a generative image model, Stable Diffusion, creating an image to replace each textual premise. We evaluate our dataset both intrinsically and extrinsically. For extrinsic evaluation, we evaluate the validity of the generated images by using them as training data for a visual entailment classifier based on CLIP feature vectors. We find that synthetic training data only leads to a slight drop in quality on SNLI-VE, with an F-score 0.686 compared to 0.703 when trained on real data. We also compare the quality of our generated training data to original training data on another dataset: SICK-VTE. Again, there is only a slight drop in F-score: from 0.400 to 0.384. These results indicate that in settings with data sparsity, synthetic data can be a promising solution for training visual entailment models.

Problem

Research questions and friction points this paper is trying to address.

Creating synthetic dataset for visual entailment training

Addressing data sparsity in visual entailment datasets

Evaluating synthetic data performance in visual entailment models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative AI creates synthetic visual entailment dataset

Stable Diffusion generates images from SNLI text prompts

CLIP feature vectors train visual entailment classifiers

🔎 Similar Papers

VolDoGer: LLM-assisted Datasets for Domain Generalization in Vision-Language Tasks