Attention-weighted Centered Kernel Alignment for Knowledge Distillation in Large Audio-Language Models Applied to Speech Emotion Recognition

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses the challenge of deploying large audio-language models for speech emotion recognition, which, despite their strong performance, are hindered by excessive parameter counts and difficulties in cross-modal alignment during knowledge distillation—particularly due to feature dimension mismatches. To overcome these limitations, the authors propose PL-Distill, a novel distillation framework that introduces attention-weighted Centered Kernel Alignment (CKA) to align audio embeddings at critical time steps. The method jointly optimizes knowledge transfer from both the teacher’s projection layer (PDist) and logits layer (LDist), employing KL divergence to align multimodal outputs. Evaluated on IEMOCAP, RAVDESS, and SAVEE, PL-Distill successfully compresses an 8.4B-parameter teacher model into a 1.1B-parameter student model that consistently outperforms the original teacher, existing pretrained models, and state-of-the-art distillation baselines.

Technology Category

Application Category

📝 Abstract

The emergence of Large Audio-Language Models (LALMs) has advanced Speech Emotion Recognition (SER), but their size limits deployment in resource-constrained environments. While Knowledge Distillation is effective for LALM compression, existing methods remain underexplored in distilling the cross-modal projection module (Projector), and often struggle with alignment due to differences in feature dimensions. We propose PL-Distill, a KD framework that combines Projector-Level Distillation (PDist) to align audio embeddings and Logits-Level Distillation (LDist) to align output logits. PDist introduces Attention-weighted Centered Kernel Alignment, a novel approach we propose to highlight important time steps and address dimension mismatches. Meanwhile, LDist minimizes the Kullback-Leibler divergence between teacher and student logits from audio and text modalities. On IEMOCAP, RAVDESS, and SAVEE, PL-Distill compresses an 8.4B-parameter teacher to a compact 1.1B-parameter student, consistently outperforming the teacher, state-of-the-art pretrained models, and other KD baselines across all metrics.

Problem

Research questions and friction points this paper is trying to address.

Knowledge Distillation

Large Audio-Language Models

Speech Emotion Recognition

Cross-modal Projection

Feature Alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Knowledge Distillation

Large Audio-Language Models

Centered Kernel Alignment