🤖 AI Summary
This work addresses the challenge of enabling low-cost robotic arms to locate and grasp partially visible objects using only low-resolution first-person visual input. To this end, the authors propose a behavior cloning approach that directly predicts relative joint increment commands from wrist-mounted RGB images, allowing the robot to actively adjust its viewpoint in a closed-loop manner to improve observation quality. The experiments demonstrate that this method naturally gives rise to effective active perception behaviors without requiring sophisticated perception modules, and that predicting relative joint increments significantly outperforms absolute position prediction. Implemented on inexpensive hardware, the system achieves high grasping success rates, confirming that low-resolution vision is sufficient to support reliable manipulation in structured tasks.
📝 Abstract
We investigate whether behavior cloning is sufficient to produce active perception in a structured object-finding task. A low-cost robot arm equipped with a wrist-mounted egocentric RGB camera must reposition to center a partially visible plant before triggering a grasp signal, requiring actions that improve future observations. The model predicts joint commands directly from low-resolution RGB images under closed-loop control. We show that low-resolution egocentric vision is sufficient for reliable task completion and that predicting relative joint deltas substantially outperforms absolute joint position prediction in our setting. These results demonstrate that visually grounded active perception can emerge from behavior cloning in a reproducible setting.