🤖 AI Summary
To address the spatial perception and real-time feedback requirements of robotic grasping in realistic settings, this work systematically compares 2D visual and 3D voxel-based representations for vacuum-based box manipulation within the SERL offline reinforcement learning framework. We propose a lightweight spatial encoder architecture tailored to SERL, enabling deep fusion of voxelized 3D scene inputs with closed-loop vision–force feedback. Experiments demonstrate that the 3D representation improves grasp success rate by 42% and reduces required training samples by 60%, significantly enhancing policy sample efficiency and cross-scenario generalization. To our knowledge, this is the first systematic 2D/3D spatial representation comparison conducted on a real-world vacuum gripper task. The code and real-robot demonstration videos are publicly released.
📝 Abstract
When manipulating objects in the real world, we need reactive feedback policies that take into account sensor information to inform decisions. This study aims to determine how different encoders can be used in a reinforcement learning (RL) framework to interpret the spatial environment in the local surroundings of a robot arm. Our investigation focuses on comparing real-world vision with 3D scene inputs, exploring new architectures in the process. We built on the SERL framework, providing us with a sample efficient and stable RL foundation we could build upon, while keeping training times minimal. The results of this study indicate that spatial information helps to significantly outperform the visual counterpart, tested on a box picking task with a vacuum gripper. The code and videos of the evaluations are available at https://github.com/nisutte/voxel-serl.