Visual Intention Grounding for Egocentric Assistants

📅 2025-04-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the challenging problem of implicit-intent-driven visual grounding in first-person (egocentric) AI assistants—specifically, localizing functionally relevant yet unmentioned objects from egocentric images without explicit object queries. To tackle this, we first introduce EgoIntention, the first egocentric dataset explicitly designed for visual intent grounding. Second, we propose the Reason-to-Ground (RoG) instruction-tuning paradigm, which jointly models functional affordance, enables chain-of-thought multi-step reasoning, and employs hybrid training on both explicit descriptions and implicit intents to co-optimize intent understanding and object localization. Experiments demonstrate that RoG significantly outperforms baseline methods on EgoIntention: it accurately grounds intention-relevant objects while effectively suppressing background distractions, and—critically—maintains or even improves performance on conventional named-object grounding tasks.

Technology Category

Application Category

📝 Abstract
Visual grounding associates textual descriptions with objects in an image. Conventional methods target third-person image inputs and named object queries. In applications such as AI assistants, the perspective shifts -- inputs are egocentric, and objects may be referred to implicitly through needs and intentions. To bridge this gap, we introduce EgoIntention, the first dataset for egocentric visual intention grounding. EgoIntention challenges multimodal LLMs to 1) understand and ignore unintended contextual objects and 2) reason about uncommon object functionalities. Benchmark results show that current models misidentify context objects and lack affordance understanding in egocentric views. We also propose Reason-to-Ground (RoG) instruction tuning; it enables hybrid training with normal descriptions and egocentric intentions with a chained intention reasoning and object grounding mechanism. RoG significantly outperforms naive finetuning and hybrid training on EgoIntention, while maintaining or slightly improving naive description grounding. This advancement enables unified visual grounding for egocentric and exocentric visual inputs while handling explicit object queries and implicit human intentions.
Problem

Research questions and friction points this paper is trying to address.

Bridging egocentric and exocentric visual grounding gaps
Understanding implicit human intentions in egocentric views
Improving affordance reasoning for uncommon object functionalities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces EgoIntention dataset for egocentric grounding
Proposes Reason-to-Ground instruction tuning method
Enables hybrid training with intention reasoning
🔎 Similar Papers
No similar papers found.