🤖 AI Summary
Addressing key challenges in whole-body mobile manipulation—including complex perception modeling, discontinuous motion generation, and poor cross-environment generalization—this paper introduces DSPv2, a novel dense policy architecture. DSPv2 pioneers the intra-policy deep fusion of 3D spatial features with multi-view 2D semantic features, incorporating a 3D–2D feature alignment encoding mechanism and a multimodal perception fusion module to enable end-to-end whole-body coordination. Compared to state-of-the-art imitation learning approaches, DSPv2 achieves a +21.3% improvement in task success rate and significantly enhances cross-scenario generalization. It demonstrates superior robustness and practicality across diverse real-world environments and complex manipulation tasks. This work establishes a new paradigm for embodied agents to perform high-precision, adaptive whole-body manipulation in open, unstructured settings.
📝 Abstract
Learning whole-body mobile manipulation via imitation is essential for generalizing robotic skills to diverse environments and complex tasks. However, this goal is hindered by significant challenges, particularly in effectively processing complex observation, achieving robust generalization, and generating coherent actions. To address these issues, we propose DSPv2, a novel policy architecture. DSPv2 introduces an effective encoding scheme that aligns 3D spatial features with multi-view 2D semantic features. This fusion enables the policy to achieve broad generalization while retaining the fine-grained perception necessary for precise control. Furthermore, we extend the Dense Policy paradigm to the whole-body mobile manipulation domain, demonstrating its effectiveness in generating coherent and precise actions for the whole-body robotic platform. Extensive experiments show that our method significantly outperforms existing approaches in both task performance and generalization ability. Project page is available at: https://selen-suyue.github.io/DSPv2Net/.