Look and Tell: A Dataset for Multimodal Grounding Across Egocentric and Exocentric Views

📅 2025-10-26

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This study investigates how egocentric versus allocentric spatial representations (2D/3D, self-centered vs. other-centered) affect grounding in multimodal referential communication. To address this, we introduce the first multimodal benchmark dataset synchronously capturing first-person (via Meta Project Aria glasses) and third-person (fixed cameras) gaze, speech, video, and 3D scene reconstructions. Our method innovatively integrates eye tracking with multi-view vision, leveraging SLAM and multi-view stereo to construct a unified 3D spatial reference frame—enabling cross-perspective referential resolution. The dataset comprises 3.67 hours of naturalistic dialogues and 2,707 fine-grained annotations of referring expressions. This work establishes the first quantifiable, reproducible benchmark and methodological framework for embodied agents to achieve viewpoint alignment and contextualized referential understanding in real-world environments.

Technology Category

Application Category

📝 Abstract

We introduce Look and Tell, a multimodal dataset for studying referential communication across egocentric and exocentric perspectives. Using Meta Project Aria smart glasses and stationary cameras, we recorded synchronized gaze, speech, and video as 25 participants instructed a partner to identify ingredients in a kitchen. Combined with 3D scene reconstructions, this setup provides a benchmark for evaluating how different spatial representations (2D vs. 3D; ego vs. exo) affect multimodal grounding. The dataset contains 3.67 hours of recordings, including 2,707 richly annotated referential expressions, and is designed to advance the development of embodied agents that can understand and engage in situated dialogue.

Problem

Research questions and friction points this paper is trying to address.

Studying referential communication across egocentric and exocentric perspectives

Evaluating how spatial representations affect multimodal grounding

Advancing embodied agents for situated dialogue understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dataset combines egocentric and exocentric multimodal recordings

Uses synchronized gaze, speech, and video with 3D reconstructions

Benchmark for evaluating spatial representations in multimodal grounding

🔎 Similar Papers

No similar papers found.