Multimodal Unsupervised Domain Generalization by Retrieving Across the Modality Gap

๐Ÿ“… 2024-02-06
๐Ÿ›๏ธ International Conference on Learning Representations
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This paper addresses Multimodal Unsupervised Domain Generalization (MUDG): improving cross-domain generalization under label space mismatch between source and target domains, without target-domain annotations or explicit domain alignment priorsโ€”leveraging large-scale, task-agnostic, multimodal unlabeled data. The method introduces three key innovations: (1) paired K-means clustering to enhance cross-modal nearest-neighbor retrieval recall; (2) adaptive text prompt augmentation to improve zero-shot semantic alignment; and (3) a lightweight joint vision-language embedding space coupled with unsupervised fine-tuning. Evaluated systematically across 20 heterogeneous datasets, the approach consistently outperforms name-transfer baselines, source-free domain generalization methods, and standard zero-shot transfer techniques, delivering stable and significant accuracy gains.

Technology Category

Application Category

๐Ÿ“ Abstract
Domain generalization (DG) is an important problem that learns a model which generalizes to unseen test domains leveraging one or more source domains, under the assumption of shared label spaces. However, most DG methods assume access to abundant source data in the target label space, a requirement that proves overly stringent for numerous real-world applications, where acquiring the same label space as the target task is prohibitively expensive. For this setting, we tackle the multimodal version of the unsupervised domain generalization (MUDG) problem, which uses a large task-agnostic unlabeled source dataset during finetuning. Our framework does not explicitly assume any relationship between the source dataset and target task. Instead, it relies only on the premise that the source dataset can be accurately and efficiently searched in a joint vision-language space. We make three contributions in the MUDG setting. Firstly, we show theoretically that cross-modal approximate nearest neighbor search suffers from low recall due to the large distance between text queries and the image centroids used for coarse quantization. Accordingly, we propose paired k-means, a simple clustering algorithm that improves nearest neighbor recall by storing centroids in query space instead of image space. Secondly, we propose an adaptive text augmentation scheme for target labels designed to improve zero-shot accuracy and diversify retrieved image data. Lastly, we present two simple but effective components to further improve downstream target accuracy. We compare against state-of-the-art name-only transfer, source-free DG and zero-shot (ZS) methods on their respective benchmarks and show consistent improvement in accuracy on 20 diverse datasets. Code is available: https://github.com/Chris210634/mudg
Problem

Research questions and friction points this paper is trying to address.

Addresses multimodal unsupervised domain generalization with unlabeled data
Improves cross-modal retrieval recall via paired k-means clustering
Enhances zero-shot accuracy using adaptive text augmentation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Paired k-means improves cross-modal search recall
Adaptive text augmentation enhances zero-shot accuracy
Simple components boost downstream target accuracy
๐Ÿ”Ž Similar Papers
No similar papers found.