Talk2Event: Grounded Understanding of Dynamic Scenes from Event Cameras

📅 2025-07-23

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

This work addresses language-guided object localization in dynamic scenes for event cameras. To tackle the asynchronous, sparse, and high-temporal-resolution nature of event data, we propose the first dedicated benchmark and methodological framework. We design a four-dimensional annotation scheme—encompassing spatial, temporal, semantic, and relational attributes—and introduce an attribute-aware Mixture-of-Event-Experts (MoEE) model to enable cross-modal dynamic fusion of event streams and RGB frames. Our approach supports unimodal (event-only or frame-only) and multimodal joint inputs. Evaluated on a large-scale real-world driving dataset, it achieves state-of-the-art localization accuracy across all settings: event-only, frame-only, and multimodal fusion. This study establishes, for the first time, a systematic event-driven language–vision alignment paradigm, advancing temporal perception and semantic understanding for autonomous driving and embodied intelligence.

Technology Category

Application Category

📝 Abstract

Event cameras offer microsecond-level latency and robustness to motion blur, making them ideal for understanding dynamic environments. Yet, connecting these asynchronous streams to human language remains an open challenge. We introduce Talk2Event, the first large-scale benchmark for language-driven object grounding in event-based perception. Built from real-world driving data, we provide over 30,000 validated referring expressions, each enriched with four grounding attributes -- appearance, status, relation to viewer, and relation to other objects -- bridging spatial, temporal, and relational reasoning. To fully exploit these cues, we propose EventRefer, an attribute-aware grounding framework that dynamically fuses multi-attribute representations through a Mixture of Event-Attribute Experts (MoEE). Our method adapts to different modalities and scene dynamics, achieving consistent gains over state-of-the-art baselines in event-only, frame-only, and event-frame fusion settings. We hope our dataset and approach will establish a foundation for advancing multimodal, temporally-aware, and language-driven perception in real-world robotics and autonomy.

Problem

Research questions and friction points this paper is trying to address.

Connecting asynchronous event camera streams to human language.

Bridging spatial, temporal, and relational reasoning in event-based perception.

Advancing multimodal, language-driven perception for real-world robotics.

Innovation

Methods, ideas, or system contributions that make the work stand out.

EventRefer framework fuses multi-attribute representations

Mixture of Event-Attribute Experts (MoEE) dynamically adapts

Large-scale dataset with 30,000+ validated expressions

🔎 Similar Papers

No similar papers found.