VolDoGer: LLM-assisted Datasets for Domain Generalization in Vision-Language Tasks

📅 2024-07-29

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Vision-language models exhibit limited generalization across domains, primarily due to the absence of standardized domain generalization (DG) benchmarks. To address this gap, we introduce VolDoGer—the first DG benchmark for vision-language tasks, covering image captioning, visual question answering (VQA), and visual entailment. VolDoGer comprises both multi-source real-world domains and controllably shifted synthetic domains. We innovatively extend large language model (LLM)-driven data synthesis to multimodal DG, integrating cross-domain prompt engineering, multi-stage vision-language alignment annotation, and controllable domain shift construction to achieve high-quality, low-cost, multi-domain labeling. Comprehensive evaluation of 12 state-of-the-art models on VolDoGer reveals critical cross-domain performance bottlenecks. The benchmark is publicly released to foster standardized, reproducible research in multimodal domain generalization.

Technology Category

Application Category

📝 Abstract

Domain generalizability is a crucial aspect of a deep learning model since it determines the capability of the model to perform well on data from unseen domains. However, research on the domain generalizability of deep learning models for vision-language tasks remains limited, primarily because of the lack of required datasets. To address these challenges, we propose VolDoGer: Vision-Language Dataset for Domain Generalization, a dedicated dataset designed for domain generalization that addresses three vision-language tasks: image captioning, visual question answering, and visual entailment. We constructed VolDoGer by extending LLM-based data annotation techniques to vision-language tasks, thereby alleviating the burden of recruiting human annotators. We evaluated the domain generalizability of various models, ranging from fine-tuned models to a recent multimodal large language model, through VolDoGer.

Problem

Research questions and friction points this paper is trying to address.

Lack of datasets for vision-language domain generalization

Need for domain generalizability in vision-language tasks

Challenges in human annotation for vision-language data

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based data annotation for vision-language tasks

Dedicated dataset for domain generalization evaluation

Extends annotation to image captioning, VQA, entailment

🔎 Similar Papers

No similar papers found.