Cross-Modal Few-Shot Learning: a Generative Transfer Learning Framework

📅 2024-10-14

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

This work addresses the fundamental mismatch between real-world multimodal data and the unimodal assumption underlying conventional few-shot learning, proposing a new task: Cross-Modal Few-Shot Learning (CFSL)—enabling concept recognition and transfer across modalities using only a minimal number of cross-modal labeled samples. To this end, we introduce the Generative Transfer Learning (GTL) framework, the first of its kind, which jointly models cross-modal shared semantics and modality-specific perturbations; crucially, the generative module is frozen during transfer to ensure representation stability. GTL integrates generative modeling, cross-modal representation learning, and two-stage latent variable joint estimation. Extensive experiments on four benchmarks—Sketchy, TU-Berlin, Mask1K, and SKSF-A—demonstrate substantial improvements over state-of-the-art methods, validating GTL’s strong generalization from extremely limited samples. This work establishes a novel paradigm for advancing few-shot learning toward realistic multimodal scenarios.

Technology Category

Application Category

📝 Abstract

Most existing studies on few-shot learning focus on unimodal settings, where models are trained to generalize on unseen data using only a small number of labeled examples from the same modality. However, real-world data are inherently multi-modal, and unimodal approaches limit the practical applications of few-shot learning. To address this gap, this paper introduces the Cross-modal Few-Shot Learning (CFSL) task, which aims to recognize instances from multiple modalities when only a few labeled examples are available. This task presents additional challenges compared to classical few-shot learning due to the distinct visual characteristics and structural properties unique to each modality. To tackle these challenges, we propose a Generative Transfer Learning (GTL) framework consisting of two stages: the first stage involves training on abundant unimodal data, and the second stage focuses on transfer learning to adapt to novel data. Our GTL framework jointly estimates the latent shared concept across modalities and in-modality disturbance in both stages, while freezing the generative module during the transfer phase to maintain the stability of the learned representations and prevent overfitting to the limited multi-modal samples. Our finds demonstrate that GTL has superior performance compared to state-of-the-art methods across four distinct multi-modal datasets: Sketchy, TU-Berlin, Mask1K, and SKSF-A. Additionally, the results suggest that the model can estimate latent concepts from vast unimodal data and generalize these concepts to unseen modalities using only a limited number of available samples, much like human cognitive processes.

Problem

Research questions and friction points this paper is trying to address.

Addresses few-shot learning in multi-modal settings.

Proposes Cross-modal Few-Shot Learning (CFSL) task.

Introduces Generative Transfer Learning (GTL) framework.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative Transfer Learning for cross-modal tasks

Simulates human concept abstraction and generalization

Transfers knowledge from unimodal to multimodal data

🔎 Similar Papers

ACE: A Generative Cross-Modal Retrieval Framework with Coarse-To-Fine Semantic Modeling

2024-06-25arXiv.orgCitations: 9

💼 Related Jobs

Applied Scientist - Multimodal

Adobe

San Jose, California, United States of America / Seattle, Washington, United States of America / San Francisco, California, United States of America

Research Scientist Intern, Multimodal Generative AI and Robotics (PhD)