🤖 AI Summary
In conventional fine-tuning of self-supervised speech models, fixed-layer aggregation—such as using only the top layer or static weighted summation—introduces an information bottleneck and limits cross-sample generalization. To address this, we propose VARAN, a novel framework featuring input-adaptive dynamic layer aggregation. VARAN employs layer-specific probe heads and data-dependent weights to dynamically allocate contributions from different transformer layers per sample, thereby preserving layer-specific characteristics while enhancing representational flexibility. The aggregation process is optimized via variational inference, and VARAN integrates LoRA for parameter-efficient fine-tuning. Evaluated on automatic speech recognition and speech emotion recognition tasks, VARAN consistently outperforms strong baselines; its combination with LoRA yields particularly substantial gains. These results demonstrate VARAN’s superior downstream adaptability and robust generalization capability across diverse speech understanding tasks.
📝 Abstract
Conventional methods for aggregating layers in fine-tuned self-supervised speech models, such as using the final layer or weighted sum, suffer from information bottlenecks and static feature weighting for all dataset examples. We propose VARAN, a framework that dynamically tailors layer aggregation to individual inputs. By employing layer-specialized probing heads and data-dependent weighting, VARAN adaptively prioritizes layer's features based on input. Evaluations on automatic speech recognition and speech emotion recognition tasks demonstrate VARAN's superior performance, particularly when using the LoRA fine-tuning technique. The framework resolves the trade-off between preserving layer-specific information and enabling flexible feature utilization, advancing efficient adaptation of self-supervised speech representations.