VARAN: Variational Inference for Self-Supervised Speech Models Fine-Tuning on Downstream Tasks

📅 2025-08-16

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

In conventional fine-tuning of self-supervised speech models, fixed-layer aggregation—such as using only the top layer or static weighted summation—introduces an information bottleneck and limits cross-sample generalization. To address this, we propose VARAN, a novel framework featuring input-adaptive dynamic layer aggregation. VARAN employs layer-specific probe heads and data-dependent weights to dynamically allocate contributions from different transformer layers per sample, thereby preserving layer-specific characteristics while enhancing representational flexibility. The aggregation process is optimized via variational inference, and VARAN integrates LoRA for parameter-efficient fine-tuning. Evaluated on automatic speech recognition and speech emotion recognition tasks, VARAN consistently outperforms strong baselines; its combination with LoRA yields particularly substantial gains. These results demonstrate VARAN’s superior downstream adaptability and robust generalization capability across diverse speech understanding tasks.

Technology Category

Application Category

📝 Abstract

Conventional methods for aggregating layers in fine-tuned self-supervised speech models, such as using the final layer or weighted sum, suffer from information bottlenecks and static feature weighting for all dataset examples. We propose VARAN, a framework that dynamically tailors layer aggregation to individual inputs. By employing layer-specialized probing heads and data-dependent weighting, VARAN adaptively prioritizes layer's features based on input. Evaluations on automatic speech recognition and speech emotion recognition tasks demonstrate VARAN's superior performance, particularly when using the LoRA fine-tuning technique. The framework resolves the trade-off between preserving layer-specific information and enabling flexible feature utilization, advancing efficient adaptation of self-supervised speech representations.

Problem

Research questions and friction points this paper is trying to address.

Dynamic layer aggregation for speech model fine-tuning

Overcoming static feature weighting in self-supervised models

Adaptive prioritization of layer features by input

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic layer aggregation tailored to inputs

Layer-specialized probing heads for adaptation

Data-dependent weighting for feature prioritization

🔎 Similar Papers

MS-HuBERT: Mitigating Pre-training and Inference Mismatch in Masked Language Modelling methods for learning Speech Representations