RoboRetriever: Single-Camera Robot Object Retrieval via Active and Interactive Perception with Dynamic Scene Graph

📅 2025-08-18

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

This paper addresses the challenge of efficiently retrieving target objects in cluttered, partially observable environments using a single wrist-mounted RGB-D camera guided by natural language instructions. We propose a dynamic hierarchical scene graph modeling framework that tightly integrates active perception, interactive perception, and manipulation. Our key contributions are: (1) a task-aware dynamic scene graph unifying semantic, geometric, and spatial-functional relationships among objects; (2) a vision-language large model–driven 6-DoF camera pose inference mechanism enabling human-like active viewpoint planning and interactive exploration; and (3) robust object retrieval under multi-person intervention and severe occlusion. Experiments demonstrate that our method significantly reduces reliance on multiple fixed cameras or globally visible scenes, achieving high retrieval accuracy and strong adaptability in real-world complex environments.

Technology Category

Application Category

📝 Abstract

Humans effortlessly retrieve objects in cluttered, partially observable environments by combining visual reasoning, active viewpoint adjustment, and physical interaction-with only a single pair of eyes. In contrast, most existing robotic systems rely on carefully positioned fixed or multi-camera setups with complete scene visibility, which limits adaptability and incurs high hardware costs. We present extbf{RoboRetriever}, a novel framework for real-world object retrieval that operates using only a extbf{single} wrist-mounted RGB-D camera and free-form natural language instructions. RoboRetriever grounds visual observations to build and update a extbf{dynamic hierarchical scene graph} that encodes object semantics, geometry, and inter-object relations over time. The supervisor module reasons over this memory and task instruction to infer the target object and coordinate an integrated action module combining extbf{active perception}, extbf{interactive perception}, and extbf{manipulation}. To enable task-aware scene-grounded active perception, we introduce a novel visual prompting scheme that leverages large reasoning vision-language models to determine 6-DoF camera poses aligned with the semantic task goal and geometry scene context. We evaluate RoboRetriever on diverse real-world object retrieval tasks, including scenarios with human intervention, demonstrating strong adaptability and robustness in cluttered scenes with only one RGB-D camera.

Problem

Research questions and friction points this paper is trying to address.

Single-camera robot object retrieval in cluttered environments

Dynamic scene graph for object semantics and relations

Active and interactive perception with natural language instructions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Single wrist-mounted RGB-D camera operation

Dynamic hierarchical scene graph encoding

Active and interactive perception integration

🔎 Similar Papers

Autonomous Exploration and Semantic Updating of Large-Scale Indoor Environments with Mobile Robots

2024-09-23arXiv.orgCitations: 0

OpenAI

$380K – $445K • Offers Equity

San Francisco, CA, USA

Research Scientist, Sensor and Systems Robotics (PhD)