VARAN: Variational Inference for Self-Supervised Speech Models Fine-Tuning on Downstream Tasks

📅 2025-08-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In conventional fine-tuning of self-supervised speech models, fixed-layer aggregation—such as using only the top layer or static weighted summation—introduces an information bottleneck and limits cross-sample generalization. To address this, we propose VARAN, a novel framework featuring input-adaptive dynamic layer aggregation. VARAN employs layer-specific probe heads and data-dependent weights to dynamically allocate contributions from different transformer layers per sample, thereby preserving layer-specific characteristics while enhancing representational flexibility. The aggregation process is optimized via variational inference, and VARAN integrates LoRA for parameter-efficient fine-tuning. Evaluated on automatic speech recognition and speech emotion recognition tasks, VARAN consistently outperforms strong baselines; its combination with LoRA yields particularly substantial gains. These results demonstrate VARAN’s superior downstream adaptability and robust generalization capability across diverse speech understanding tasks.

Technology Category

Application Category

📝 Abstract
Conventional methods for aggregating layers in fine-tuned self-supervised speech models, such as using the final layer or weighted sum, suffer from information bottlenecks and static feature weighting for all dataset examples. We propose VARAN, a framework that dynamically tailors layer aggregation to individual inputs. By employing layer-specialized probing heads and data-dependent weighting, VARAN adaptively prioritizes layer's features based on input. Evaluations on automatic speech recognition and speech emotion recognition tasks demonstrate VARAN's superior performance, particularly when using the LoRA fine-tuning technique. The framework resolves the trade-off between preserving layer-specific information and enabling flexible feature utilization, advancing efficient adaptation of self-supervised speech representations.
Problem

Research questions and friction points this paper is trying to address.

Dynamic layer aggregation for speech model fine-tuning
Overcoming static feature weighting in self-supervised models
Adaptive prioritization of layer features by input
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic layer aggregation tailored to inputs
Layer-specialized probing heads for adaptation
Data-dependent weighting for feature prioritization
🔎 Similar Papers
No similar papers found.