DCIM-AVSR : Efficient Audio-Visual Speech Recognition via Dual Conformer Interaction Module

📅 2024-08-31
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of large parameter count, high training cost, and deployment complexity in audio-visual speech recognition (AVSR) under noisy conditions, this paper proposes a lightweight audio-visual joint modeling framework. Methodologically, it introduces: (1) a novel Dual Conformer Interaction Module (DCIM) that explicitly models hierarchical coupling between lip movements and acoustic signals at the architectural level; (2) a selective parameter update pretraining strategy to balance transfer efficiency and downstream performance; and (3) a multimodal feature alignment mechanism enabling fine-grained audio-visual fusion. Experiments demonstrate that the proposed approach significantly reduces model parameters and computational overhead while achieving superior recognition accuracy over mainstream unimodal and coarse-grained fusion baselines in noisy environments. These improvements substantially enhance the practicality and deployability of AVSR systems.

Technology Category

Application Category

📝 Abstract
Speech recognition is the technology that enables machines to interpret and process human speech, converting spoken language into text or commands. This technology is essential for applications such as virtual assistants, transcription services, and communication tools. The Audio-Visual Speech Recognition (AVSR) model enhances traditional speech recognition, particularly in noisy environments, by incorporating visual modalities like lip movements and facial expressions. While traditional AVSR models trained on large-scale datasets with numerous parameters can achieve remarkable accuracy, often surpassing human performance, they also come with high training costs and deployment challenges. To address these issues, we introduce an efficient AVSR model that reduces the number of parameters through the integration of a Dual Conformer Interaction Module (DCIM). In addition, we propose a pre-training method that further optimizes model performance by selectively updating parameters, leading to significant improvements in efficiency. Unlike conventional models that require the system to independently learn the hierarchical relationship between audio and visual modalities, our approach incorporates this distinction directly into the model architecture. This design enhances both efficiency and performance, resulting in a more practical and effective solution for AVSR tasks.
Problem

Research questions and friction points this paper is trying to address.

Efficient AVSR
Noisy Environment
Reduced Training Cost
Innovation

Methods, ideas, or system contributions that make the work stand out.

DCIM-AVSR
Dual Fusion Module
Efficient Real-time Recognition
🔎 Similar Papers
No similar papers found.
X
Xinyu Wang
School of Biomedical Engineering & State Key Laboratory of Advanced Medical Materials and Devices, ShanghaiTech University, Shanghai, China
Q
Qian Wang
School of Biomedical Engineering & State Key Laboratory of Advanced Medical Materials and Devices, ShanghaiTech University, Shanghai, China