🤖 AI Summary
Existing audio-visual speech recognition (AVSR) methods predominantly rely on a single visual modality—typically lip movements—while overlooking the potential of synergistic modeling across multiple visual cues. This work proposes LiPS-AVSR, the first AVSR framework to incorporate presentation slides as structured visual cues alongside lip motion for Chinese AVSR. To support this, we introduce the first 100-hour Chinese audio-visual dataset jointly annotated with lip-motion and PowerPoint slide modalities. Methodologically, LiPS-AVSR comprises three core components: lip-feature extraction, slide text-visual representation alignment, and end-to-end audio-visual joint decoding. Experiments demonstrate that lip reading alone improves ASR performance by ~8%, slide information contributes ~25%, and their fusion yields a 35% relative WER reduction over audio-only baselines—substantially outperforming unimodal counterparts. This work pioneers the expansion of AVSR’s visual modality dimensionality and semantic granularity.
📝 Abstract
Incorporating visual modalities to assist Automatic Speech Recognition (ASR) tasks has led to significant improvements. However, existing Audio-Visual Speech Recognition (AVSR) datasets and methods typically rely solely on lip-reading information or speaking contextual video, neglecting the potential of combining these different valuable visual cues within the speaking context. In this paper, we release a multimodal Chinese AVSR dataset, Chinese-LiPS, comprising 100 hours of speech, video, and corresponding manual transcription, with the visual modality encompassing both lip-reading information and the presentation slides used by the speaker. Based on Chinese-LiPS, we develop a simple yet effective pipeline, LiPS-AVSR, which leverages both lip-reading and presentation slide information as visual modalities for AVSR tasks. Experiments show that lip-reading and presentation slide information improve ASR performance by approximately 8% and 25%, respectively, with a combined performance improvement of about 35%. The dataset is available at https://kiri0824.github.io/Chinese-LiPS/