One-stage Modality Distillation for Incomplete Multimodal Learning

📅 2023-09-15

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Multimodal learning faces significant challenges during inference when certain modalities are missing (e.g., absent depth in RGB-D data), undermining robustness and generalization. Method: This paper proposes a single-stage modality distillation framework that unifies privileged knowledge transfer and cross-modal information fusion. Specifically: (1) it introduces a novel single-stage, multi-task joint optimization paradigm to eliminate error accumulation from sequential training; (2) it designs a joint adaptive network to mitigate representational heterogeneity across modalities; and (3) it constructs a parameter-shared cross-modal translation module to explicitly model semantic mappings between modalities. Results: Evaluated on RGB-D classification and segmentation tasks under incomplete-input settings, the method consistently outperforms existing approaches for partial-modality inference, achieving state-of-the-art performance. It significantly enhances model generalization under modality dropout scenarios while maintaining architectural efficiency and training stability.

📝 Abstract

Learning based on multimodal data has attracted increasing interest recently. While a variety of sensory modalities can be collected for training, not all of them are always available in development scenarios, which raises the challenge to infer with incomplete modality. To address this issue, this paper presents a one-stage modality distillation framework that unifies the privileged knowledge transfer and modality information fusion into a single optimization procedure via multi-task learning. Compared with the conventional modality distillation that performs them independently, this helps to capture the valuable representation that can assist the final model inference directly. Specifically, we propose the joint adaptation network for the modality transfer task to preserve the privileged information. This addresses the representation heterogeneity caused by input discrepancy via the joint distribution adaptation. Then, we introduce the cross translation network for the modality fusion task to aggregate the restored and available modality features. It leverages the parameters-sharing strategy to capture the cross-modal cues explicitly. Extensive experiments on RGB-D classification and segmentation tasks demonstrate the proposed multimodal inheritance framework can overcome the problem of incomplete modality input in various scenes and achieve state-of-the-art performance.

Problem

Research questions and friction points this paper is trying to address.

Addresses incomplete multimodal learning in development scenarios

Proposes one-stage distillation for knowledge transfer and fusion

Solves representation heterogeneity and cross-modal feature aggregation

Innovation

Methods, ideas, or system contributions that make the work stand out.

One-stage modality distillation via multi-task learning

Joint adaptation network for privileged information preservation

Cross translation network for explicit cross-modal fusion

🔎 Similar Papers

No similar papers found.