DynaMem: Online Dynamic Spatio-Semantic Memory for Open World Mobile Manipulation

πŸ“… 2024-11-07
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 8
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the semantic perception failure caused by the static-environment assumption in open-vocabulary mobile manipulation, this paper proposes the first online dynamic spatiotemporal semantic memory framework tailored for real-world dynamic scenes. Our method centers on an incremental 3D point cloud memory that supports real-time insertion, deletion, update, and query of objects under motion, occlusion, and entry/exit events. It integrates multimodal large language models (MLLMs) with open-vocabulary vision-language features (e.g., CLIP/SigLIP) to enable natural-language-driven cross-modal object localization. The framework is deployed in real time on the Stretch SE3 robotic platform. Evaluated across three real-world and nine offline dynamic scenarios, our approach achieves a 70% success rate in grasping non-stationary objectsβ€”more than doubling the performance of the best static-baseline method. All code, demonstration videos, and deployment documentation are publicly released.

Technology Category

Application Category

πŸ“ Abstract
Significant progress has been made in open-vocabulary mobile manipulation, where the goal is for a robot to perform tasks in any environment given a natural language description. However, most current systems assume a static environment, which limits the system's applicability in real-world scenarios where environments frequently change due to human intervention or the robot's own actions. In this work, we present DynaMem, a new approach to open-world mobile manipulation that uses a dynamic spatio-semantic memory to represent a robot's environment. DynaMem constructs a 3D data structure to maintain a dynamic memory of point clouds, and answers open-vocabulary object localization queries using multimodal LLMs or open-vocabulary features generated by state-of-the-art vision-language models. Powered by DynaMem, our robots can explore novel environments, search for objects not found in memory, and continuously update the memory as objects move, appear, or disappear in the scene. We run extensive experiments on the Stretch SE3 robots in three real and nine offline scenes, and achieve an average pick-and-drop success rate of 70% on non-stationary objects, which is more than a 2x improvement over state-of-the-art static systems. Our code as well as our experiment and deployment videos are open sourced and can be found on our project website: https://dynamem.github.io/
Problem

Research questions and friction points this paper is trying to address.

Dynamic environment handling in mobile manipulation tasks
Open-vocabulary object localization in changing scenes
Continuous memory updates for object movement and appearance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic spatio-semantic memory for environment representation
3D data structure for dynamic point cloud memory
Multimodal LLMs for open-vocabulary object localization