LRS-VoxMM: A benchmark for in-the-wild audio-visual speech recognition

📅 2026-04-30

📈 Citations: 0

✨ Influential: 0

career value

234K/year

🤖 AI Summary

Current audio-visual speech recognition (AVSR) benchmarks are predominantly based on controlled environments and thus fail to capture the complexity of real-world scenarios. This work proposes LRS-VoxMM—the first AVSR benchmark constructed from large-scale, naturally occurring conversational data—by carefully filtering and aligning samples from the VoxMM dataset and adopting the LRS format for compatibility with established pipelines. To enhance ecological validity, the benchmark incorporates acoustic distortions such as additive noise, reverberation, and bandwidth limitations, forming multiple robustness evaluation subsets. Experimental results demonstrate that LRS-VoxMM is substantially more challenging than LRS3, and under audio-degraded conditions, the visual modality plays a significantly more critical role. This benchmark therefore offers a more realistic and demanding platform for advancing AVSR research in practical settings.

📝 Abstract

We introduce LRS-VoxMM, an in-the-wild benchmark for audio-visual speech recognition (AVSR). The benchmark is derived from VoxMM, a dataset of diverse real-world spoken conversations with human-annotated transcriptions. We select AVSR-suitable samples and preprocess them in an LRS-style format for direct use in existing AVSR pipelines. Compared with commonly used benchmarks, LRS-VoxMM covers a more diverse range of scenarios and acoustic conditions. We also release distorted evaluation sets with additive noise, reverberation, and bandwidth limitation to support evaluation under severe acoustic degradation. Experimental results show that LRS-VoxMM is considerably harder than LRS3 and that the contribution of visual information becomes more evident as the audio signal degrades. LRS-VoxMM supports more realistic AVSR benchmarking and encourages further research on the role of visual information in challenging real-world conditions.

Problem

Research questions and friction points this paper is trying to address.

audio-visual speech recognition

in-the-wild

benchmark

acoustic degradation

real-world conditions

Innovation

Methods, ideas, or system contributions that make the work stand out.

audio-visual speech recognition

in-the-wild benchmark

acoustic degradation