Intent3D: 3D Object Detection in RGB-D Scans Based on Human Intention

📅 2024-05-28
🏛️ arXiv.org
📈 Citations: 5
Influential: 0
📄 PDF
🤖 AI Summary
This paper introduces “human-intent-driven 3D object detection,” a novel task wherein AI localizes target objects in RGB-D scenes solely from natural-language intents (e.g., “I need something to support my back”), without explicit object references. Methodologically, it (1) formally defines the 3D intent grounding problem; (2) introduces Intent3D, the first large-scale benchmark comprising 44,990 intent utterances annotated across 209 fine-grained object categories; and (3) proposes IntentNet—a multimodal architecture integrating language-3D alignment, vision-language joint embedding, and cascaded adaptive optimization. Evaluated on Intent3D, IntentNet significantly outperforms multiple strong baselines—including state-of-the-art vision-language and 3D detection models—demonstrating effective cross-semantic-level reasoning from abstract linguistic intent to concrete physical objects. The results validate both the feasibility of intent-driven 3D perception and the generalizability of the proposed framework to real-world, reference-free interaction scenarios.

Technology Category

Application Category

📝 Abstract
In real-life scenarios, humans seek out objects in the 3D world to fulfill their daily needs or intentions. This inspires us to introduce 3D intention grounding, a new task in 3D object detection employing RGB-D, based on human intention, such as"I want something to support my back". Closely related, 3D visual grounding focuses on understanding human reference. To achieve detection based on human intention, it relies on humans to observe the scene, reason out the target that aligns with their intention ("pillow"in this case), and finally provide a reference to the AI system, such as"A pillow on the couch". Instead, 3D intention grounding challenges AI agents to automatically observe, reason and detect the desired target solely based on human intention. To tackle this challenge, we introduce the new Intent3D dataset, consisting of 44,990 intention texts associated with 209 fine-grained classes from 1,042 scenes of the ScanNet dataset. We also establish several baselines based on different language-based 3D object detection models on our benchmark. Finally, we propose IntentNet, our unique approach, designed to tackle this intention-based detection problem. It focuses on three key aspects: intention understanding, reasoning to identify object candidates, and cascaded adaptive learning that leverages the intrinsic priority logic of different losses for multiple objective optimization. Project Page: https://weitaikang.github.io/Intent3D-webpage/
Problem

Research questions and friction points this paper is trying to address.

3D object detection based on human intention
Automated observation and reasoning in 3D scenes
Development of the Intent3D dataset and IntentNet model
Innovation

Methods, ideas, or system contributions that make the work stand out.

3D intention grounding
IntentNet approach
cascaded adaptive learning
🔎 Similar Papers
No similar papers found.