π€ AI Summary
This paper addresses the challenge of efficiently retrieving target objects in cluttered, partially observable environments using a single wrist-mounted RGB-D camera guided by natural language instructions. We propose a dynamic hierarchical scene graph modeling framework that tightly integrates active perception, interactive perception, and manipulation. Our key contributions are: (1) a task-aware dynamic scene graph unifying semantic, geometric, and spatial-functional relationships among objects; (2) a vision-language large modelβdriven 6-DoF camera pose inference mechanism enabling human-like active viewpoint planning and interactive exploration; and (3) robust object retrieval under multi-person intervention and severe occlusion. Experiments demonstrate that our method significantly reduces reliance on multiple fixed cameras or globally visible scenes, achieving high retrieval accuracy and strong adaptability in real-world complex environments.
π Abstract
Humans effortlessly retrieve objects in cluttered, partially observable environments by combining visual reasoning, active viewpoint adjustment, and physical interaction-with only a single pair of eyes. In contrast, most existing robotic systems rely on carefully positioned fixed or multi-camera setups with complete scene visibility, which limits adaptability and incurs high hardware costs. We present extbf{RoboRetriever}, a novel framework for real-world object retrieval that operates using only a extbf{single} wrist-mounted RGB-D camera and free-form natural language instructions. RoboRetriever grounds visual observations to build and update a extbf{dynamic hierarchical scene graph} that encodes object semantics, geometry, and inter-object relations over time. The supervisor module reasons over this memory and task instruction to infer the target object and coordinate an integrated action module combining extbf{active perception}, extbf{interactive perception}, and extbf{manipulation}. To enable task-aware scene-grounded active perception, we introduce a novel visual prompting scheme that leverages large reasoning vision-language models to determine 6-DoF camera poses aligned with the semantic task goal and geometry scene context. We evaluate RoboRetriever on diverse real-world object retrieval tasks, including scenarios with human intervention, demonstrating strong adaptability and robustness in cluttered scenes with only one RGB-D camera.