Lightweight Operations for Visual Speech Recognition

📅 2025-02-07

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

To address the high computational cost and deployment challenges of visual speech recognition (VSR) on resource-constrained devices—stemming from the high dimensionality of video inputs—this paper proposes a lightweight operator design paradigm tailored for VSR. Our method integrates efficient spatiotemporal convolutions, an attention compression module, and synergistic knowledge distillation with structured pruning to jointly optimize model parameters, FLOPs, and inference latency. Evaluated end-to-end on the LRW benchmark, the resulting model achieves a 5× reduction in parameter count relative to state-of-the-art (SOTA) models, a 3.8× speedup in inference latency, and maintains a competitive accuracy of 94.2%—only 0.6 percentage points below the SOTA. To the best of our knowledge, this is the first VSR framework that enables real-time, edge-deployable operation without compromising high accuracy.

Technology Category

Application Category

📝 Abstract

Visual speech recognition (VSR), which decodes spoken words from video data, offers significant benefits, particularly when audio is unavailable. However, the high dimensionality of video data leads to prohibitive computational costs that demand powerful hardware, limiting VSR deployment on resource-constrained devices. This work addresses this limitation by developing lightweight VSR architectures. Leveraging efficient operation design paradigms, we create compact yet powerful models with reduced resource requirements and minimal accuracy loss. We train and evaluate our models on a large-scale public dataset for recognition of words from video sequences, demonstrating their effectiveness for practical applications. We also conduct an extensive array of ablative experiments to thoroughly analyze the size and complexity of each model. Code and trained models will be made publicly available.

Problem

Research questions and friction points this paper is trying to address.

Develop lightweight VSR architectures

Reduce computational costs for VSR

Enable VSR on resource-constrained devices

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight VSR architectures developed

Efficient operation design paradigms utilized

Models trained on large-scale dataset

🔎 Similar Papers

DCIM-AVSR : Efficient Audio-Visual Speech Recognition via Dual Conformer Interaction Module