Chinese-LiPS: A Chinese audio-visual speech recognition dataset with Lip-reading and Presentation Slides

📅 2025-04-21

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

Existing audio-visual speech recognition (AVSR) methods predominantly rely on a single visual modality—typically lip movements—while overlooking the potential of synergistic modeling across multiple visual cues. This work proposes LiPS-AVSR, the first AVSR framework to incorporate presentation slides as structured visual cues alongside lip motion for Chinese AVSR. To support this, we introduce the first 100-hour Chinese audio-visual dataset jointly annotated with lip-motion and PowerPoint slide modalities. Methodologically, LiPS-AVSR comprises three core components: lip-feature extraction, slide text-visual representation alignment, and end-to-end audio-visual joint decoding. Experiments demonstrate that lip reading alone improves ASR performance by ~8%, slide information contributes ~25%, and their fusion yields a 35% relative WER reduction over audio-only baselines—substantially outperforming unimodal counterparts. This work pioneers the expansion of AVSR’s visual modality dimensionality and semantic granularity.

Technology Category

Application Category

📝 Abstract

Incorporating visual modalities to assist Automatic Speech Recognition (ASR) tasks has led to significant improvements. However, existing Audio-Visual Speech Recognition (AVSR) datasets and methods typically rely solely on lip-reading information or speaking contextual video, neglecting the potential of combining these different valuable visual cues within the speaking context. In this paper, we release a multimodal Chinese AVSR dataset, Chinese-LiPS, comprising 100 hours of speech, video, and corresponding manual transcription, with the visual modality encompassing both lip-reading information and the presentation slides used by the speaker. Based on Chinese-LiPS, we develop a simple yet effective pipeline, LiPS-AVSR, which leverages both lip-reading and presentation slide information as visual modalities for AVSR tasks. Experiments show that lip-reading and presentation slide information improve ASR performance by approximately 8% and 25%, respectively, with a combined performance improvement of about 35%. The dataset is available at https://kiri0824.github.io/Chinese-LiPS/

Problem

Research questions and friction points this paper is trying to address.

Combining lip-reading and slides for better speech recognition

Lack of datasets with multiple visual cues in AVSR

Improving Chinese ASR using multimodal visual information

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines lip-reading and presentation slides

Develops LiPS-AVSR multimodal pipeline

Improves ASR performance by 35%

🔎 Similar Papers

Personalized Lip Reading: Adapting to Your Unique Lip Movements with Vision and Language