Decoupled Hierarchical Distillation for Multimodal Emotion Recognition.

📅 2026-02-03

🏛️ IEEE Transactions on Pattern Analysis and Machine Intelligence

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

This work addresses the challenges of modality heterogeneity and imbalanced contribution in multimodal sentiment analysis by proposing a decoupled hierarchical distillation framework. The method introduces, for the first time, an autoregressive decoupling mechanism that decomposes each modality’s features into modality-invariant and modality-specific components. It further integrates a Graph Distillation Unit (GD-Unit) with cross-modal dictionary matching to enable two-stage knowledge distillation—coarse-grained followed by fine-grained—thereby enhancing both cross-modal alignment and sentiment discriminability. Evaluated on the CMU-MOSI and CMU-MOSEI datasets, the approach achieves consistent improvements over state-of-the-art methods, with gains of 1.3% and 2.4% in 7-class accuracy (ACC₇), 1.3% and 1.9% in binary accuracy (ACC₂), and 1.9% and 1.8% in F1 score, respectively.

Technology Category

Application Category

📝 Abstract

Human multimodal emotion recognition (MER) seeks to infer human emotions by integrating information from language, visual, and acoustic modalities. Although existing MER approaches have achieved promising results, they still struggle with inherent multimodal heterogeneities and varying contributions from different modalities. To address these challenges, we propose a novel framework, Decoupled Hierarchical Multimodal Distillation (DHMD). DHMD decouples each modality's features into modality-irrelevant (homogeneous) and modality-exclusive (heterogeneous) components using a self-regression mechanism. The framework employs a two-stage knowledge distillation (KD) strategy: (1) coarse-grained KD via a Graph Distillation Unit (GD-Unit) in each decoupled feature space, where a dynamic graph facilitates adaptive distillation among modalities, and (2) fine-grained KD through a cross-modal dictionary matching mechanism, which aligns semantic granularities across modalities to produce more discriminative MER representations. This hierarchical distillation approach enables flexible knowledge transfer and effectively improves cross-modal feature alignment. Experimental results demonstrate that DHMD consistently outperforms state-of-the-art MER methods, achieving 1.3%/2.4% (ACC$_{7}$), 1.3%/1.9% (ACC$_{2}$) and 1.9%/1.8% (F1) relative improvement on CMU-MOSI/CMU-MOSEI dataset, respectively. Meanwhile, visualization results reveal that both the graph edges and dictionary activations in DHMD exhibit meaningful distribution patterns across modality-irrelevant/-exclusive feature spaces.

Problem

Research questions and friction points this paper is trying to address.

multimodal emotion recognition

modality heterogeneity

cross-modal alignment

knowledge distillation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoupled Representation

Hierarchical Knowledge Distillation

Multimodal Emotion Recognition

Graph Distillation Unit