StereoPolicy: Improving Robotic Manipulation Policies via Stereo Perception

📅 2026-05-11
📈 Citations: 0
Influential: 0
📄 PDF

career value

228K/year
🤖 AI Summary
Monocular vision struggles to support precise robotic manipulation in complex environments due to the absence of reliable depth cues. This work proposes StereoPolicy, a framework that seamlessly integrates stereo vision into diffusion models and vision-language-action (VLA) policies for the first time. It employs pretrained 2D encoders to process synchronized stereo image pairs and introduces a Stereo Transformer to implicitly model disparity and spatial correspondences—without requiring explicit 3D reconstruction or camera calibration. Evaluated on three major simulation benchmarks—RoboMimic, RoboCasa, and OmniGibson—the method significantly outperforms baselines using RGB, RGB-D, point clouds, or multi-view inputs. Furthermore, real-world experiments on tabletop and dual-arm robotic tasks demonstrate its effectiveness, substantially enhancing the policy’s spatial understanding capabilities.
📝 Abstract
Recent advances in robot imitation learning have yielded powerful visuomotor policies capable of manipulating a wide variety of objects directly from monocular visual inputs. However, monocular observations inherently lack reliable depth cues and spatial awareness, which are critical for precise manipulation in cluttered or geometrically complex scenes. To address this limitation, we introduce StereoPolicy, a new visuomotor policy learning framework that directly leverages synchronized stereo image pairs to strengthen geometric reasoning, without requiring explicit 3D reconstruction or camera calibration. StereoPolicy employs pretrained 2D vision encoders to process each image independently and fuses the resulting representations through a Stereo Transformer. This design implicitly captures spatial correspondence and disparity cues. The framework integrates seamlessly with diffusion-based and pretrained vision-language-action (VLA) policies, delivering consistent improvements over RGB, RGB-D, point cloud, and multi-view baselines across three simulation benchmarks: RoboMimic, RoboCasa, and OmniGibson. We further validate StereoPolicy on real-robot experiments spanning both tabletop and bimanual mobile manipulation settings. Our results underscore stereo vision as a scalable and robust modality that bridges 2D pretrained representations with 3D geometric understanding for robotic manipulation.
Problem

Research questions and friction points this paper is trying to address.

monocular vision
depth perception
spatial awareness
robotic manipulation
stereo vision
Innovation

Methods, ideas, or system contributions that make the work stand out.

StereoPolicy
stereo vision
visuomotor policy
geometric reasoning
Stereo Transformer