Look and Tell: A Dataset for Multimodal Grounding Across Egocentric and Exocentric Views

๐Ÿ“… 2025-10-26
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This study investigates how egocentric versus allocentric spatial representations (2D/3D, self-centered vs. other-centered) affect grounding in multimodal referential communication. To address this, we introduce the first multimodal benchmark dataset synchronously capturing first-person (via Meta Project Aria glasses) and third-person (fixed cameras) gaze, speech, video, and 3D scene reconstructions. Our method innovatively integrates eye tracking with multi-view vision, leveraging SLAM and multi-view stereo to construct a unified 3D spatial reference frameโ€”enabling cross-perspective referential resolution. The dataset comprises 3.67 hours of naturalistic dialogues and 2,707 fine-grained annotations of referring expressions. This work establishes the first quantifiable, reproducible benchmark and methodological framework for embodied agents to achieve viewpoint alignment and contextualized referential understanding in real-world environments.

Technology Category

Application Category

๐Ÿ“ Abstract
We introduce Look and Tell, a multimodal dataset for studying referential communication across egocentric and exocentric perspectives. Using Meta Project Aria smart glasses and stationary cameras, we recorded synchronized gaze, speech, and video as 25 participants instructed a partner to identify ingredients in a kitchen. Combined with 3D scene reconstructions, this setup provides a benchmark for evaluating how different spatial representations (2D vs. 3D; ego vs. exo) affect multimodal grounding. The dataset contains 3.67 hours of recordings, including 2,707 richly annotated referential expressions, and is designed to advance the development of embodied agents that can understand and engage in situated dialogue.
Problem

Research questions and friction points this paper is trying to address.

Studying referential communication across egocentric and exocentric perspectives
Evaluating how spatial representations affect multimodal grounding
Advancing embodied agents for situated dialogue understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dataset combines egocentric and exocentric multimodal recordings
Uses synchronized gaze, speech, and video with 3D reconstructions
Benchmark for evaluating spatial representations in multimodal grounding
๐Ÿ”Ž Similar Papers
No similar papers found.