A Cocktail-Party Benchmark: Multi-Modal dataset and Comparative Evaluation Results

📅 2025-10-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the cocktail party problem—multi-talker overlapping speech in a single room—by proposing Multimodal Context-Aware Recognition (MCoRec), a unified task for fine-grained understanding of “who spoke what, when, and with whom,” jointly performing speaker identification, automatic speech recognition, and dialogue grouping. Methodologically, it integrates audio signals, visual lip movements, and contextual semantic cues within an end-to-end multimodal deep learning framework. Key contributions include: (i) the first large-scale, naturally collected multimodal dataset of social multi-party overlapping conversations, specifically designed for modeling highly fragmented and strongly overlapping utterances; and (ii) empirical validation showing that a pure-audio baseline achieves a word error rate exceeding 100%, while incorporating visual modality reduces it by 50%, demonstrating the critical role of multimodal synergy in complex conversational parsing.

Technology Category

Application Category

📝 Abstract
We introduce the task of Multi-Modal Context-Aware Recognition (MCoRec) in the ninth CHiME Challenge, which addresses the cocktail-party problem of overlapping conversations in a single-room setting using audio, visual, and contextual cues. MCoRec captures natural multi-party conversations where the recordings focus on unscripted, casual group chats, leading to extreme speech overlap of up to 100% and highly fragmented conversational turns. The task requires systems to answer the question "Who speaks when, what, and with whom?" by jointly transcribing each speaker's speech and clustering them into their respective conversations from audio-visual recordings. Audio-only baselines exceed 100% word error rate, whereas incorporating visual cues yields substantial 50% improvements, highlighting the importance of multi-modality. In this manuscript, we present the motivation behind the task, outline the data collection process, and report the baseline systems developed for the MCoRec.
Problem

Research questions and friction points this paper is trying to address.

Solving cocktail-party problem with multi-modal cues
Identifying overlapping speakers and conversations automatically
Transcribing and clustering speech from audio-visual recordings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal audio-visual context-aware recognition system
Joint speech transcription and conversation clustering approach
Visual cues integration for overlapping speech disambiguation
🔎 Similar Papers
No similar papers found.