RoboRetriever: Single-Camera Robot Object Retrieval via Active and Interactive Perception with Dynamic Scene Graph

πŸ“… 2025-08-18
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This paper addresses the challenge of efficiently retrieving target objects in cluttered, partially observable environments using a single wrist-mounted RGB-D camera guided by natural language instructions. We propose a dynamic hierarchical scene graph modeling framework that tightly integrates active perception, interactive perception, and manipulation. Our key contributions are: (1) a task-aware dynamic scene graph unifying semantic, geometric, and spatial-functional relationships among objects; (2) a vision-language large model–driven 6-DoF camera pose inference mechanism enabling human-like active viewpoint planning and interactive exploration; and (3) robust object retrieval under multi-person intervention and severe occlusion. Experiments demonstrate that our method significantly reduces reliance on multiple fixed cameras or globally visible scenes, achieving high retrieval accuracy and strong adaptability in real-world complex environments.

Technology Category

Application Category

πŸ“ Abstract
Humans effortlessly retrieve objects in cluttered, partially observable environments by combining visual reasoning, active viewpoint adjustment, and physical interaction-with only a single pair of eyes. In contrast, most existing robotic systems rely on carefully positioned fixed or multi-camera setups with complete scene visibility, which limits adaptability and incurs high hardware costs. We present extbf{RoboRetriever}, a novel framework for real-world object retrieval that operates using only a extbf{single} wrist-mounted RGB-D camera and free-form natural language instructions. RoboRetriever grounds visual observations to build and update a extbf{dynamic hierarchical scene graph} that encodes object semantics, geometry, and inter-object relations over time. The supervisor module reasons over this memory and task instruction to infer the target object and coordinate an integrated action module combining extbf{active perception}, extbf{interactive perception}, and extbf{manipulation}. To enable task-aware scene-grounded active perception, we introduce a novel visual prompting scheme that leverages large reasoning vision-language models to determine 6-DoF camera poses aligned with the semantic task goal and geometry scene context. We evaluate RoboRetriever on diverse real-world object retrieval tasks, including scenarios with human intervention, demonstrating strong adaptability and robustness in cluttered scenes with only one RGB-D camera.
Problem

Research questions and friction points this paper is trying to address.

Single-camera robot object retrieval in cluttered environments
Dynamic scene graph for object semantics and relations
Active and interactive perception with natural language instructions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Single wrist-mounted RGB-D camera operation
Dynamic hierarchical scene graph encoding
Active and interactive perception integration
πŸ”Ž Similar Papers
No similar papers found.
H
Hecheng Wang
College for Elite Engineer, Fudan University
J
Jiankun Ren
College for Elite Engineer, Fudan University
Jia Yu
Jia Yu
Co-founder, Wherobots Inc.; Assistant Professor of Computer Science, Washington State University
Database systemsData managementGeospatial databasesGIS
L
Lizhe Qi
College for Elite Engineer, Fudan University
Y
Yunquan Sun
College for Elite Engineer, Fudan University