π€ AI Summary
Vision-language models exhibit limited generalization across domains, primarily due to the absence of standardized domain generalization (DG) benchmarks. To address this gap, we introduce VolDoGerβthe first DG benchmark for vision-language tasks, covering image captioning, visual question answering (VQA), and visual entailment. VolDoGer comprises both multi-source real-world domains and controllably shifted synthetic domains. We innovatively extend large language model (LLM)-driven data synthesis to multimodal DG, integrating cross-domain prompt engineering, multi-stage vision-language alignment annotation, and controllable domain shift construction to achieve high-quality, low-cost, multi-domain labeling. Comprehensive evaluation of 12 state-of-the-art models on VolDoGer reveals critical cross-domain performance bottlenecks. The benchmark is publicly released to foster standardized, reproducible research in multimodal domain generalization.
π Abstract
Domain generalizability is a crucial aspect of a deep learning model since it determines the capability of the model to perform well on data from unseen domains. However, research on the domain generalizability of deep learning models for vision-language tasks remains limited, primarily because of the lack of required datasets. To address these challenges, we propose VolDoGer: Vision-Language Dataset for Domain Generalization, a dedicated dataset designed for domain generalization that addresses three vision-language tasks: image captioning, visual question answering, and visual entailment. We constructed VolDoGer by extending LLM-based data annotation techniques to vision-language tasks, thereby alleviating the burden of recruiting human annotators. We evaluated the domain generalizability of various models, ranging from fine-tuned models to a recent multimodal large language model, through VolDoGer.