Efficient Audiovisual Speech Processing via MUTUD: Multimodal Training and Unimodal Deployment

📅 2025-01-30

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

To address strong sensor dependency, high computational overhead, and audio-visual synchronization challenges in deploying multimodal speech systems, this paper proposes MUTUD—a framework that jointly trains on audio-visual modalities but operates inference using only a single modality. Its core innovation is the TAME (Temporal Alignment-based Multimodal Estimation) module, which dynamically estimates cross-modal features via temporal alignment, enabling single-modality inference to closely approximate full multimodal performance. MUTUD integrates multimodal representation learning, modality distillation, and lightweight network design. Evaluated across multiple audio-visual speech tasks, it significantly narrows the performance gap between unimodal and multimodal inference while reducing model size and computational cost by up to 80%. To our knowledge, MUTUD is the first framework to realize the efficient “multimodal training, unimodal deployment” paradigm.

Technology Category

Application Category

📝 Abstract

Building reliable speech systems often requires combining multiple modalities, like audio and visual cues. While such multimodal solutions frequently lead to improvements in performance and may even be critical in certain cases, they come with several constraints such as increased sensory requirements, computational cost, and modality synchronization, to mention a few. These challenges constrain the direct uses of these multimodal solutions in real-world applications. In this work, we develop approaches where the learning happens with all available modalities but the deployment or inference is done with just one or reduced modalities. To do so, we propose a Multimodal Training and Unimodal Deployment (MUTUD) framework which includes a Temporally Aligned Modality feature Estimation (TAME) module that can estimate information from missing modality using modalities present during inference. This innovative approach facilitates the integration of information across different modalities, enhancing the overall inference process by leveraging the strengths of each modality to compensate for the absence of certain modalities during inference. We apply MUTUD to various audiovisual speech tasks and show that it can reduce the performance gap between the multimodal and corresponding unimodal models to a considerable extent. MUTUD can achieve this while reducing the model size and compute compared to multimodal models, in some cases by almost 80%.

Problem

Research questions and friction points this paper is trying to address.

Audio-Video Processing

Computational Complexity

Synchronization Issues

Innovation

Methods, ideas, or system contributions that make the work stand out.

MUTUD

TAME module

multi-modal learning

🔎 Similar Papers

No similar papers found.