Decoupled Hierarchical Distillation for Multimodal Emotion Recognition.

📅 2026-02-03
🏛️ IEEE Transactions on Pattern Analysis and Machine Intelligence
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of modality heterogeneity and imbalanced contribution in multimodal sentiment analysis by proposing a decoupled hierarchical distillation framework. The method introduces, for the first time, an autoregressive decoupling mechanism that decomposes each modality’s features into modality-invariant and modality-specific components. It further integrates a Graph Distillation Unit (GD-Unit) with cross-modal dictionary matching to enable two-stage knowledge distillation—coarse-grained followed by fine-grained—thereby enhancing both cross-modal alignment and sentiment discriminability. Evaluated on the CMU-MOSI and CMU-MOSEI datasets, the approach achieves consistent improvements over state-of-the-art methods, with gains of 1.3% and 2.4% in 7-class accuracy (ACC₇), 1.3% and 1.9% in binary accuracy (ACC₂), and 1.9% and 1.8% in F1 score, respectively.

Technology Category

Application Category

📝 Abstract
Human multimodal emotion recognition (MER) seeks to infer human emotions by integrating information from language, visual, and acoustic modalities. Although existing MER approaches have achieved promising results, they still struggle with inherent multimodal heterogeneities and varying contributions from different modalities. To address these challenges, we propose a novel framework, Decoupled Hierarchical Multimodal Distillation (DHMD). DHMD decouples each modality's features into modality-irrelevant (homogeneous) and modality-exclusive (heterogeneous) components using a self-regression mechanism. The framework employs a two-stage knowledge distillation (KD) strategy: (1) coarse-grained KD via a Graph Distillation Unit (GD-Unit) in each decoupled feature space, where a dynamic graph facilitates adaptive distillation among modalities, and (2) fine-grained KD through a cross-modal dictionary matching mechanism, which aligns semantic granularities across modalities to produce more discriminative MER representations. This hierarchical distillation approach enables flexible knowledge transfer and effectively improves cross-modal feature alignment. Experimental results demonstrate that DHMD consistently outperforms state-of-the-art MER methods, achieving 1.3%/2.4% (ACC$_{7}$), 1.3%/1.9% (ACC$_{2}$) and 1.9%/1.8% (F1) relative improvement on CMU-MOSI/CMU-MOSEI dataset, respectively. Meanwhile, visualization results reveal that both the graph edges and dictionary activations in DHMD exhibit meaningful distribution patterns across modality-irrelevant/-exclusive feature spaces.
Problem

Research questions and friction points this paper is trying to address.

multimodal emotion recognition
modality heterogeneity
cross-modal alignment
knowledge distillation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoupled Representation
Hierarchical Knowledge Distillation
Multimodal Emotion Recognition
Graph Distillation Unit
Cross-modal Dictionary Matching
🔎 Similar Papers
No similar papers found.