VolumeDP: Modeling Volumetric Representation for Manipulation Policy Learning

📅 2026-03-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the spatial mismatch between 2D image observations and 3D action outputs in visual imitation learning by proposing an end-to-voxel 3D representation architecture. The method elevates 2D image features into a voxelized 3D space via cross-attention, employs a learnable module to select task-relevant voxels and compress them into a compact set of spatial tokens, and then uses a multi-token decoder to jointly predict actions. This design circumvents the geometric information loss inherent in conventional feature aggregation schemes, substantially enhancing spatial reasoning and robustness. The approach achieves an average success rate of 88.8% on the LIBERO, ManiSkill, and LIBERO-Plus benchmarks, outperforming the strongest baseline by 14.8%, and demonstrates strong generalization to novel layouts, viewpoints, and backgrounds in real-world scenarios.

Technology Category

Application Category

📝 Abstract
Imitation learning is a prominent paradigm for robotic manipulation. However, existing visual imitation methods map 2D image observations directly to 3D action outputs, imposing a 2D-3D mismatch that hinders spatial reasoning and degrades robustness. We present VolumeDP, a policy architecture that restores spatial alignment by explicitly reasoning in 3D. VolumeDP first lifts image features into a Volumetric Representation via cross-attention. It then selects task-relevant voxels with a learnable module and converts them into a compact set of spatial tokens, markedly reducing computation while preserving action-critical geometry. Finally, a multi-token decoder conditions on the entire token set to predict actions, thereby avoiding lossy aggregation that collapses multiple spatial tokens into a single descriptor. VolumeDP achieves a state-of-the-art average success rate of 88.8% on the LIBERO simulation benchmark, outperforming the strongest baseline by a substantial 14.8% improvement. It also delivers large performance gains over prior methods on the ManiSkill and LIBERO-Plus benchmarks. Real-world experiments further demonstrate higher success rates and robust generalization to novel spatial layouts, camera viewpoints, and environment backgrounds. Code will be released.
Problem

Research questions and friction points this paper is trying to address.

visual imitation learning
2D-3D mismatch
spatial reasoning
robotic manipulation
volumetric representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

volumetric representation
3D spatial reasoning
visual imitation learning
spatial tokens
cross-attention
🔎 Similar Papers
2024-07-16Neural Information Processing SystemsCitations: 16
T
Tianxing Zhou
IIIS, Tsinghua University
F
Feiyang Xue
IIIS, Tsinghua University
Z
Zhangchen Ye
IIIS, Tsinghua University
Tianyuan Yuan
Tianyuan Yuan
Tsinghua University
Computer Vision
Hang Zhao
Hang Zhao
Assistant Professor, Tsinghua University
Multimodal LearningAutonomous DrivingRobot LearningEmbodied AI
T
Tao Jiang
Galaxea AI