Adapter-Based Multi-Agent AVSR Extension for Pre-Trained ASR Models

📅 2025-02-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Audio-visual speech recognition (AVSR) under diverse and varying noise conditions remains challenging, particularly when visual input is unreliable or unavailable. Method: This paper proposes a lightweight, adaptive visual enhancement framework built upon the pre-trained Whisper model. It integrates LoRA adapters with a dedicated visual branch, forming a novel multi-agent adapter architecture: distinct adapter groups are trained for specific noise types and intensities, dynamically scheduled by a noise-scene classifier; crucially, the system seamlessly degrades to audio-only ASR when visual input is absent. Contribution/Results: The method fine-tunes only 0.7% of Whisper’s parameters—reducing trainable parameters by 88.5% versus full fine-tuning—yet achieves near-state-of-the-art accuracy across most noisy scenarios. It delivers strong robustness to heterogeneous acoustic distortions, minimal computational overhead, and inherent scalability to new noise conditions and modalities.

Technology Category

Application Category

📝 Abstract
We present an approach to Audio-Visual Speech Recognition that builds on a pre-trained Whisper model. To infuse visual information into this audio-only model, we extend it with an AV fusion module and LoRa adapters, one of the most up-to-date adapter approaches. One advantage of adapter-based approaches, is that only a relatively small number of parameters are trained, while the basic model remains unchanged. Common AVSR approaches train single models to handle several noise categories and noise levels simultaneously. Taking advantage of the lightweight nature of adapter approaches, we train noise-scenario-specific adapter-sets, each covering individual noise-categories or a specific noise-level range. The most suitable adapter-set is selected by previously classifying the noise-scenario. This enables our models to achieve an optimum coverage across different noise-categories and noise-levels, while training only a minimum number of parameters. Compared to a full fine-tuning approach with SOTA performance our models achieve almost comparable results over the majority of the tested noise-categories and noise-levels, with up to 88.5% less trainable parameters. Our approach can be extended by further noise-specific adapter-sets to cover additional noise scenarios. It is also possible to utilize the underlying powerful ASR model when no visual information is available, as it remains unchanged.
Problem

Research questions and friction points this paper is trying to address.

Enhancing ASR with visual data
Reducing parameters via adapters
Optimizing noise-specific speech recognition
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapter-based AVSR extension
LoRa adapters integration
Noise-scenario-specific adapter-sets