Gaze-Enhanced Multimodal Turn-Taking Prediction in Triadic Conversations

📅 2025-05-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Accurate turn-taking prediction in noisy three-party group conversations remains challenging, particularly for real-time hearing assistance via smart glasses. Method: This paper proposes a lightweight multimodal turn-prediction framework tailored to this scenario. It innovatively models first-person gaze as an active, spatially constrained predictive cue—rather than a passive behavioral feature—and jointly fuses gaze estimation with sound source localization. A spatiotemporal attention mechanism enables effective multimodal sequential modeling. Contribution/Results: We introduce a privacy-preserving single-user gaze baseline paradigm that achieves state-of-the-art (SOTA) performance using gaze data from only one participant; extending it to multiple users further captures dynamic interaction patterns. Evaluated on a real-world three-party conversational dataset, our method improves F1 score by 12.3% and achieves end-to-end latency under 150 ms, satisfying stringent real-time hearing-aid requirements.

Technology Category

Application Category

📝 Abstract
Turn-taking prediction is crucial for seamless interactions. This study introduces a novel, lightweight framework for accurate turn-taking prediction in triadic conversations without relying on computationally intensive methods. Unlike prior approaches that either disregard gaze or treat it as a passive signal, our model integrates gaze with speaker localization, structuring it within a spatial constraint to transform it into a reliable predictive cue. Leveraging egocentric behavioral cues, our experiments demonstrate that incorporating gaze data from a single-user significantly improves prediction performance, while gaze data from multiple-users further enhances it by capturing richer conversational dynamics. This study presents a lightweight and privacy-conscious approach to support adaptive, directional sound control, enhancing speech intelligibility in noisy environments, particularly for hearing assistance in smart glasses.
Problem

Research questions and friction points this paper is trying to address.

Predicting turn-taking in triadic conversations using gaze
Integrating gaze with speaker localization for better accuracy
Enhancing speech intelligibility in noisy environments via smart glasses
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight framework for triadic turn-taking prediction
Integrates gaze with speaker localization spatially
Uses egocentric cues for adaptive sound control
🔎 Similar Papers
No similar papers found.
S
Seongsil Heo
University of California, Santa Cruz, USA
Calvin Murdock
Calvin Murdock
Reality Labs Research at Meta
computer visionmachine learning
M
Michael Proulx
Meta Reality Labs Research, USA
C
Christi Miller
Meta Reality Labs Research, USA