VolDoGer: LLM-assisted Datasets for Domain Generalization in Vision-Language Tasks

πŸ“… 2024-07-29
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 1
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Vision-language models exhibit limited generalization across domains, primarily due to the absence of standardized domain generalization (DG) benchmarks. To address this gap, we introduce VolDoGerβ€”the first DG benchmark for vision-language tasks, covering image captioning, visual question answering (VQA), and visual entailment. VolDoGer comprises both multi-source real-world domains and controllably shifted synthetic domains. We innovatively extend large language model (LLM)-driven data synthesis to multimodal DG, integrating cross-domain prompt engineering, multi-stage vision-language alignment annotation, and controllable domain shift construction to achieve high-quality, low-cost, multi-domain labeling. Comprehensive evaluation of 12 state-of-the-art models on VolDoGer reveals critical cross-domain performance bottlenecks. The benchmark is publicly released to foster standardized, reproducible research in multimodal domain generalization.

Technology Category

Application Category

πŸ“ Abstract
Domain generalizability is a crucial aspect of a deep learning model since it determines the capability of the model to perform well on data from unseen domains. However, research on the domain generalizability of deep learning models for vision-language tasks remains limited, primarily because of the lack of required datasets. To address these challenges, we propose VolDoGer: Vision-Language Dataset for Domain Generalization, a dedicated dataset designed for domain generalization that addresses three vision-language tasks: image captioning, visual question answering, and visual entailment. We constructed VolDoGer by extending LLM-based data annotation techniques to vision-language tasks, thereby alleviating the burden of recruiting human annotators. We evaluated the domain generalizability of various models, ranging from fine-tuned models to a recent multimodal large language model, through VolDoGer.
Problem

Research questions and friction points this paper is trying to address.

Lack of datasets for vision-language domain generalization
Need for domain generalizability in vision-language tasks
Challenges in human annotation for vision-language data
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based data annotation for vision-language tasks
Dedicated dataset for domain generalization evaluation
Extends annotation to image captioning, VQA, entailment
πŸ”Ž Similar Papers
No similar papers found.
Juhwan Choi
Juhwan Choi
AITRICS
Deep LearningNatural Language Processing
J
Junehyoung Kwon
Chung-Ang University, Republic of Korea, Seoul
J
Jungmin Yun
Chung-Ang University, Republic of Korea, Seoul
S
Seunguk Yu
Chung-Ang University, Republic of Korea, Seoul
Youngbin Kim
Youngbin Kim
Senior Researcher, ETRI (Electronics and Telecommunications Research Institute)