Attention-weighted Centered Kernel Alignment for Knowledge Distillation in Large Audio-Language Models Applied to Speech Emotion Recognition

📅 2026-02-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of deploying large audio-language models for speech emotion recognition, which, despite their strong performance, are hindered by excessive parameter counts and difficulties in cross-modal alignment during knowledge distillation—particularly due to feature dimension mismatches. To overcome these limitations, the authors propose PL-Distill, a novel distillation framework that introduces attention-weighted Centered Kernel Alignment (CKA) to align audio embeddings at critical time steps. The method jointly optimizes knowledge transfer from both the teacher’s projection layer (PDist) and logits layer (LDist), employing KL divergence to align multimodal outputs. Evaluated on IEMOCAP, RAVDESS, and SAVEE, PL-Distill successfully compresses an 8.4B-parameter teacher model into a 1.1B-parameter student model that consistently outperforms the original teacher, existing pretrained models, and state-of-the-art distillation baselines.

Technology Category

Application Category

📝 Abstract
The emergence of Large Audio-Language Models (LALMs) has advanced Speech Emotion Recognition (SER), but their size limits deployment in resource-constrained environments. While Knowledge Distillation is effective for LALM compression, existing methods remain underexplored in distilling the cross-modal projection module (Projector), and often struggle with alignment due to differences in feature dimensions. We propose PL-Distill, a KD framework that combines Projector-Level Distillation (PDist) to align audio embeddings and Logits-Level Distillation (LDist) to align output logits. PDist introduces Attention-weighted Centered Kernel Alignment, a novel approach we propose to highlight important time steps and address dimension mismatches. Meanwhile, LDist minimizes the Kullback-Leibler divergence between teacher and student logits from audio and text modalities. On IEMOCAP, RAVDESS, and SAVEE, PL-Distill compresses an 8.4B-parameter teacher to a compact 1.1B-parameter student, consistently outperforming the teacher, state-of-the-art pretrained models, and other KD baselines across all metrics.
Problem

Research questions and friction points this paper is trying to address.

Knowledge Distillation
Large Audio-Language Models
Speech Emotion Recognition
Cross-modal Projection
Feature Alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Knowledge Distillation
Large Audio-Language Models
Centered Kernel Alignment
Cross-modal Projection
Speech Emotion Recognition
🔎 Similar Papers
No similar papers found.
Q
Qingran Yang
Ping An Technology (Shenzhen) Co., Ltd., Shenzhen, China
B
Botao Zhao
Ping An Technology (Shenzhen) Co., Ltd., Shenzhen, China
Z
Zuheng Kang
Ping An Technology (Shenzhen) Co., Ltd., Shenzhen, China
X
Xue Li
Harbin Institute of Technology, Harbin, China
Y
Yayun He
Ping An Technology (Shenzhen) Co., Ltd., Shenzhen, China
C
Chuhang Liu
Ping An Technology (Shenzhen) Co., Ltd., Shenzhen, China
Xulong Zhang
Xulong Zhang
Ping An Technology (Shenzhen) Co., Ltd.
Federated Large ModelsTrusted ComputingGraph Computing
X
Xiaoyang Qu
Ping An Technology (Shenzhen) Co., Ltd., Shenzhen, China
J
Junqing Peng
Ping An Technology (Shenzhen) Co., Ltd., Shenzhen, China
Jianzong Wang
Jianzong Wang
Postdoctoral Researcher of Department of Electrical and Computer Engineering, University of Florida
Big DataStorage SystemCloud Computing