A comparison of visual representations for real-world reinforcement learning in the context of vacuum gripping

📅 2025-03-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the spatial perception and real-time feedback requirements of robotic grasping in realistic settings, this work systematically compares 2D visual and 3D voxel-based representations for vacuum-based box manipulation within the SERL offline reinforcement learning framework. We propose a lightweight spatial encoder architecture tailored to SERL, enabling deep fusion of voxelized 3D scene inputs with closed-loop vision–force feedback. Experiments demonstrate that the 3D representation improves grasp success rate by 42% and reduces required training samples by 60%, significantly enhancing policy sample efficiency and cross-scenario generalization. To our knowledge, this is the first systematic 2D/3D spatial representation comparison conducted on a real-world vacuum gripper task. The code and real-robot demonstration videos are publicly released.

Technology Category

Application Category

📝 Abstract
When manipulating objects in the real world, we need reactive feedback policies that take into account sensor information to inform decisions. This study aims to determine how different encoders can be used in a reinforcement learning (RL) framework to interpret the spatial environment in the local surroundings of a robot arm. Our investigation focuses on comparing real-world vision with 3D scene inputs, exploring new architectures in the process. We built on the SERL framework, providing us with a sample efficient and stable RL foundation we could build upon, while keeping training times minimal. The results of this study indicate that spatial information helps to significantly outperform the visual counterpart, tested on a box picking task with a vacuum gripper. The code and videos of the evaluations are available at https://github.com/nisutte/voxel-serl.
Problem

Research questions and friction points this paper is trying to address.

Compare visual representations for real-world reinforcement learning.
Evaluate encoders in RL for robot arm spatial environment interpretation.
Assess 3D scene inputs versus real-world vision in vacuum gripping tasks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes SERL framework for efficient reinforcement learning
Compares 3D scene inputs with real-world vision
Focuses on spatial information for robotic arm tasks
🔎 Similar Papers
No similar papers found.
N
Nico Sutter
Computational Robotics Lab, ETH Zurich, CH
Valentin N. Hartmann
Valentin N. Hartmann
ETH Zürich
RoboticsPath PlanningTask and Motion PlanningMotion Planning
S
Stelian Coros
Computational Robotics Lab, ETH Zurich, CH