🤖 AI Summary
This work addresses the spatial mismatch between 2D image observations and 3D action outputs in visual imitation learning by proposing an end-to-voxel 3D representation architecture. The method elevates 2D image features into a voxelized 3D space via cross-attention, employs a learnable module to select task-relevant voxels and compress them into a compact set of spatial tokens, and then uses a multi-token decoder to jointly predict actions. This design circumvents the geometric information loss inherent in conventional feature aggregation schemes, substantially enhancing spatial reasoning and robustness. The approach achieves an average success rate of 88.8% on the LIBERO, ManiSkill, and LIBERO-Plus benchmarks, outperforming the strongest baseline by 14.8%, and demonstrates strong generalization to novel layouts, viewpoints, and backgrounds in real-world scenarios.
📝 Abstract
Imitation learning is a prominent paradigm for robotic manipulation. However, existing visual imitation methods map 2D image observations directly to 3D action outputs, imposing a 2D-3D mismatch that hinders spatial reasoning and degrades robustness. We present VolumeDP, a policy architecture that restores spatial alignment by explicitly reasoning in 3D. VolumeDP first lifts image features into a Volumetric Representation via cross-attention. It then selects task-relevant voxels with a learnable module and converts them into a compact set of spatial tokens, markedly reducing computation while preserving action-critical geometry. Finally, a multi-token decoder conditions on the entire token set to predict actions, thereby avoiding lossy aggregation that collapses multiple spatial tokens into a single descriptor. VolumeDP achieves a state-of-the-art average success rate of 88.8% on the LIBERO simulation benchmark, outperforming the strongest baseline by a substantial 14.8% improvement. It also delivers large performance gains over prior methods on the ManiSkill and LIBERO-Plus benchmarks. Real-world experiments further demonstrate higher success rates and robust generalization to novel spatial layouts, camera viewpoints, and environment backgrounds. Code will be released.