MoHAVE: Mixture of Hierarchical Audio-Visual Experts for Robust Speech Recognition

📅 2025-02-11

📈 Citations: 0

✨ Influential: 0

career value

238K/year

🤖 AI Summary

To address the challenge of balancing scalability and robustness in audio-visual speech recognition (AVSR) under noisy conditions, this paper proposes a novel input-adaptive Hierarchical Visual-Audio Mixture-of-Experts (H-VA-MoE) framework. Methodologically, it introduces the first hierarchical sparse MoE architecture, integrating context-aware dynamic gating with modality-specific expert activation to enable fine-grained cross-modal alignment and joint representation learning of audio-visual features. Unlike conventional dense models, H-VA-MoE significantly enhances model capacity and noise robustness without incurring redundant computation. Empirically, it achieves state-of-the-art performance on the LRS3 and MuAViC benchmarks, delivering substantial accuracy gains while maintaining controlled inference overhead. This work establishes a new paradigm for practical, large-scale robust AVSR systems.

Technology Category

Application Category

📝 Abstract

Audio-visual speech recognition (AVSR) has become critical for enhancing speech recognition in noisy environments by integrating both auditory and visual modalities. However, existing AVSR systems struggle to scale up without compromising computational efficiency. In this study, we introduce MoHAVE (Mixture of Hierarchical Audio-Visual Experts), a novel robust AVSR framework designed to address these scalability constraints. By leveraging a Mixture-of-Experts (MoE) architecture, MoHAVE activates modality-specific expert groups, ensuring dynamic adaptation to various audio-visual inputs with minimal computational overhead. Key contributions of MoHAVE include: (1) a sparse MoE framework that efficiently scales AVSR model capacity, (2) a hierarchical gating mechanism that dynamically utilizes the expert groups based on input context, enhancing adaptability and robustness, and (3) remarkable performance across robust AVSR benchmarks, including LRS3 and MuAViC transcription and translation tasks, setting a new standard for scalable speech recognition systems.

Problem

Research questions and friction points this paper is trying to address.

scalable audio-visual speech recognition

computational efficiency in AVSR

robust performance in noisy environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts architecture

Hierarchical gating mechanism

Scalable AVSR framework

🔎 Similar Papers

Comparative study on noise-augmented training and its effect on adversarial robustness in ASR systems