Open-set Cross Modal Generalization via Multimodal Unified Representation

📅 2025-07-20

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

This paper introduces Open-Set Cross-Modal Generalization (OSCMG), a novel task addressing the robust generalization of multimodal models to *new modalities* and *unseen categories* in open-world settings—thereby overcoming the limitations of conventional closed-set cross-modal evaluation. To tackle this challenge, we propose a unified multimodal representation framework integrating (i) fine-grained and coarse-grained masked contrastive learning, (ii) cross-modal unified jigsaw self-supervision, and (iii) modality-agnostic feature selection. Our method jointly optimizes masked multimodal InfoNCE loss and jigsaw reconstruction to enhance both semantic consistency and feature diversity. Extensive experiments on both standard Cross-Modal Generalization (CMG) and the newly proposed OSCMG benchmarks demonstrate significant improvements in cross-modal transferability and unknown-category recognition accuracy. To our knowledge, this is the first work to systematically advance multimodal learning toward open-set scenarios, establishing a foundational paradigm for scalable, adaptive multimodal generalization.

Technology Category

Application Category

📝 Abstract

This paper extends Cross Modal Generalization (CMG) to open-set environments by proposing the more challenging Open-set Cross Modal Generalization (OSCMG) task. This task evaluates multimodal unified representations in open-set conditions, addressing the limitations of prior closed-set cross-modal evaluations. OSCMG requires not only cross-modal knowledge transfer but also robust generalization to unseen classes within new modalities, a scenario frequently encountered in real-world applications. Existing multimodal unified representation work lacks consideration for open-set environments. To tackle this, we propose MICU, comprising two key components: Fine-Coarse Masked multimodal InfoNCE (FCMI) and Cross modal Unified Jigsaw Puzzles (CUJP). FCMI enhances multimodal alignment by applying contrastive learning at both holistic semantic and temporal levels, incorporating masking to enhance generalization. CUJP enhances feature diversity and model uncertainty by integrating modality-agnostic feature selection with self-supervised learning, thereby strengthening the model's ability to handle unknown categories in open-set tasks. Extensive experiments on CMG and the newly proposed OSCMG validate the effectiveness of our approach. The code is available at https://github.com/haihuangcode/CMG.

Problem

Research questions and friction points this paper is trying to address.

Extends Cross Modal Generalization to open-set environments

Evaluates multimodal unified representations in open-set conditions

Enhances generalization to unseen classes in new modalities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-Coarse Masked multimodal InfoNCE enhances alignment

Cross modal Unified Jigsaw Puzzles boost feature diversity

Multimodal unified representation for open-set generalization

🔎 Similar Papers

No similar papers found.