Beyond Feature Fusion: Contextual Bayesian PEFT for Multimodal Uncertainty Estimation

📅 2026-04-17

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Existing parameter-efficient fine-tuning (PEFT) methods struggle to model the uncertainty induced by audio context—such as background noise and channel variations—thereby limiting reliability in speech tasks. This work proposes CoCo-LoRA, a multimodal, uncertainty-aware PEFT approach that innovatively treats audio as a contextual uncertainty signal rather than a fused feature. By integrating Bayesian low-rank adapters with context-conditioned variational inference, CoCo-LoRA enables hierarchical, layer-specific uncertainty modulation from global to local levels. The method incorporates lightweight inter-layer audio projection heads and heteroscedastic uncertainty modeling, consistently outperforming text-only PEFT and conventional feature fusion baselines across diverse speech tasks and backbone architectures. Notably, it achieves substantial gains in adaptation reliability under high-coverage labeling scenarios.

Technology Category

Application Category

📝 Abstract

We introduce CoCo-LoRA, a multimodal, uncertainty-aware parameter-efficient fine-tuning method for text prediction tasks accompanied by audio context. Existing PEFT approaches such as LoRA are efficient but typically deterministic, while recent Bayesian low-rank adapters model uncertainty in a lightweight way yet remain largely unimodal and condition uncertainty primarily on internal text features. This leaves them poorly equipped to reflect uncertainty driven by external acoustic factors such as background noise, channel variability, or speaking style, which can materially affect reliability in speech-centered applications. CoCo-LoRA addresses this gap by conditioning a contextual variational posterior in the low-rank space on both local text-derived adapter features and an audio-derived context signal. A pooled audio embedding is projected once into a shared context space and then adapted through lightweight layer-wise heads, enabling global-to-local, depth-specific modulation of the adapter uncertainty and update without high-dimensional multimodal fusion. Stochasticity is confined to a compact latent component in the rank space, preserving PEFT scalability while producing audio-sensitive, heteroscedastic uncertainty. Based on our evaluations across diverse tasks and backbone combinations, CoCo-LoRA consistently matches or outperforms text-only PEFT and conventional feature-fusion transfer baselines, particularly on high-coverage labels where reliable adaptation is critical. The results indicate that using audio as a contextual uncertainty signal, rather than as a fused feature stream, provides a robust and parameter-efficient alternative for multimodal low-resource prediction.

Problem

Research questions and friction points this paper is trying to address.

multimodal uncertainty estimation

parameter-efficient fine-tuning

audio context

Bayesian adaptation

heteroscedastic uncertainty

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bayesian PEFT

multimodal uncertainty

contextual modulation