An LLM Benchmark for Addressee Recognition in Multi-modal Multi-party Dialogue

📅 2025-01-28

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This work addresses the implicit coreference modeling challenge of *addressee recognition*—identifying the intended recipient of the next utterance—in multi-party, multi-modal dialogue. We introduce the first dedicated corpus and benchmark for tri-modal (speech, text, vision), multi-participant dialogue, featuring the first formal task definition and systematic annotation: only 20% of utterances contain explicit addressee references, underscoring the need for sophisticated implicit reasoning. Methodologically, we integrate multi-modal dialogue collection, structured dialogue-state modeling, and confusion analysis, and conduct zero-shot and few-shot evaluation using GPT-4o. Experiments reveal that state-of-the-art large language models achieve only ~33.3% accuracy—near-chance performance—demonstrating a critical deficiency in modeling dynamic participant roles and social grounding. Our contribution provides a foundational benchmark and key insights for advancing multi-agent interaction and socially aware dialogue AI.

Technology Category

Application Category

📝 Abstract

Handling multi-party dialogues represents a significant step for advancing spoken dialogue systems, necessitating the development of tasks specific to multi-party interactions. To address this challenge, we are constructing a multi-modal multi-party dialogue corpus of triadic (three-participant) discussions. This paper focuses on the task of addressee recognition, identifying who is being addressed to take the next turn, a critical component unique to multi-party dialogue systems. A subset of the corpus was annotated with addressee information, revealing that explicit addressees are indicated in approximately 20% of conversational turns. To evaluate the task's complexity, we benchmarked the performance of a large language model (GPT-4o) on addressee recognition. The results showed that GPT-4o achieved an accuracy only marginally above chance, underscoring the challenges of addressee recognition in multi-party dialogue. These findings highlight the need for further research to enhance the capabilities of large language models in understanding and navigating the intricacies of multi-party conversational dynamics.

Problem

Research questions and friction points this paper is trying to address.

Multi-party Chat

Speaker Diarization

Turn-taking Identification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Dialogue Database

Speaker Identification

GPT-4o Model Evaluation

🔎 Similar Papers

A Survey on Recent Advances in LLM-Based Multi-turn Dialogue Systems