An Efficient Transfer Learning Method Based on Adapter with Local Attributes for Speech Emotion Recognition

📅 2025-09-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Speech emotion recognition (SER) suffers from scarcity of high-quality labeled data and expensive retraining for cross-scenario adaptation. Method: We propose a task-agnostic, lightweight transfer learning framework built upon Wav2Vec 2.0. It features a teacher-student dual-branch adapter architecture, unsupervised clustering–derived local attribute representations, self-distillation, and masked prediction for generic knowledge transfer. Frame-level representation and utterance-level aggregation are jointly enhanced via exponential moving average, statistical attention pooling, and multi-objective optimization. Contribution/Results: Our method achieves state-of-the-art performance on IEMOCAP with significantly reduced labeling requirements—outperforming existing approaches under low-data regimes while improving cross-scenario transfer efficiency and substantially lowering dependency on large-scale annotated corpora.

Technology Category

Application Category

📝 Abstract
Existing speech emotion recognition (SER) methods commonly suffer from the lack of high-quality large-scale corpus, partly due to the complex, psychological nature of emotion which makes accurate labeling difficult and time consuming. Recently, transfer learning based methods that exploit the encoders pretrained on large-scale speech corpus (e.g., Wav2Vec2.0 and HuBERT) have shown strong potential for downstream SER tasks. However, task-specific fine-tuning remains necessary for various conversational scenarios of different topics, speakers and languages to achieve satisfactory performance. It generally requires costly encoder retraining for individual SER tasks. To address this issue, we propose to train an adapter with local attributes for efficient transfer learning. Specifically, a weighted average pooling-Transformer (WAP-Transformer) is proposed as a lightweight backbone to enrich the frame-level representation. An adapter with teacher-student branches is exploited for task-agnostic transfer learning, where the student branch is jointly optimized via mask prediction and self-distillation objectives, and the teacher branch is obtained online from the student via exponential moving average (EMA). Meanwhile, local attributes are learned from the teacher branch via unsupervised clustering, which aims to act as a universal model that provides additional semantic-rich supervisions. A statistical attentive pooling (SAP) module is proposed to obtain utterance representation for fine-tuning. To evaluate the effectiveness of the proposed adapter with local attributes, extensive experiments have been conducted on IEMOCAP. Superior performance has been reported, compared to the previous state-of-the-art methods in similar settings.
Problem

Research questions and friction points this paper is trying to address.

Addresses data scarcity in speech emotion recognition
Reduces costly retraining for task-specific SER scenarios
Enables efficient transfer learning using adapter modules
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapter with teacher-student branches for transfer learning
Weighted average pooling-Transformer as lightweight backbone
Unsupervised clustering learns local attributes for supervision
🔎 Similar Papers
No similar papers found.
H
Haoyu Song
Singapore Institute of Technology, Singapore
Ian McLoughlin
Ian McLoughlin
Professor Singapore Institute of Technology (Singapore) and USTC (China)
AI for speech & audiosignal processingembedded systemscomputer architecture
Qing Gu
Qing Gu
Nanjing University
N
Nan Jiang
School of Information Science and Technology, University of Science and Technology of China, Hefei, China
Y
Yan Song
School of Information Science and Technology, University of Science and Technology of China, Hefei, China