Cross-Modal-Domain Generalization Through Semantically Aligned Discrete Representations

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

This work addresses the challenge of simultaneously achieving cross-modal generalization and preserving modality-specific characteristics in multimodal representation learning. To this end, we propose CoDAAR, a novel framework that constructs the first competition-free unified discrete representation space. CoDAAR leverages Discrete Temporal Alignment (DTA) and Cascaded Semantic Alignment (CSA) mechanisms to establish cross-modal semantic consensus while retaining modality uniqueness. Trained via a self-supervised reconstruction objective, the method overcomes inherent limitations of both continuous and discrete representation approaches. Extensive experiments demonstrate that CoDAAR achieves state-of-the-art performance across diverse tasks—including event classification, temporal localization, video segmentation, and cross-dataset transfer—establishing a new discrete paradigm for multimodal representation learning.

📝 Abstract

Multimodal learning seeks to integrate information across diverse sensory sources, yet current approaches struggle to balance cross-modal generalizability with modality-specific structure. Continuous (implicit) methods preserve fine-grained priors but render generalization challenging, while discrete (explicit) approaches enforce shared prototypes at the expense of modality specificity. We introduce CoDAAR (Cross-modal Discrete Alignment And Reconstruction), a novel framework that resolves this long-standing trade-off by establishing semantic consensus across modality-specific codebooks through index-level alignment. This design uniquely allows CoDAAR to preserve modality-unique structures while achieving generalizable cross-modal representations within a unified discrete space. CoDAAR combines two complementary mechanisms: Discrete Temporal Alignment (DTA), which enables fine-grained temporal quantization, and Cascading Semantic Alignment (CSA), which promotes progressive cross-modal semantic agreement. Together, they establish a competition-free unified representation space. Trained with self-supervised reconstruction objectives on paired multimodal sequences, CoDAAR demonstrates robust cross-modal and cross-domain generalization. Across Cross-Modal Generalization benchmarks, including event classification, localization, video segmentation, and cross-dataset transfer, CoDAAR achieves state-of-the-art performance, establishing a new paradigm for discrete and generalizable multimodal representation learning.

Problem

Research questions and friction points this paper is trying to address.

cross-modal generalization

modality-specific structure

discrete representations

multimodal learning

domain generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

discrete representation

cross-modal generalization

semantic alignment