Zero-Shot Visual Generalization in Robot Manipulation

📅 2025-05-16

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

To address the weak generalization of robotic manipulation policies under dynamic visual environments—necessitating costly fine-tuning—this paper proposes a zero-shot visual adaptation framework. Methodologically, it integrates (1) disentangled representation learning with associative memory to jointly enhance visual–motor disentanglement and cross-scene memory retrieval; (2) a model-equivariant, 2D-plane rotation-invariant policy transformation module to improve robustness against geometric perturbations; and (3) Diffusion Policy for high-fidelity imitation learning. Evaluated in both simulation and real-world hardware settings, the framework demonstrates zero-shot robustness to diverse visual disturbances—including illumination changes, occlusions, viewpoint shifts, and style variations—without task-specific adaptation. It significantly outperforms state-of-the-art imitation learning methods in visual generalization capability, establishing new benchmarks for zero-shot visual adaptation in robotic manipulation.

Technology Category

Application Category

📝 Abstract

Training vision-based manipulation policies that are robust across diverse visual environments remains an important and unresolved challenge in robot learning. Current approaches often sidestep the problem by relying on invariant representations such as point clouds and depth, or by brute-forcing generalization through visual domain randomization and/or large, visually diverse datasets. Disentangled representation learning - especially when combined with principles of associative memory - has recently shown promise in enabling vision-based reinforcement learning policies to be robust to visual distribution shifts. However, these techniques have largely been constrained to simpler benchmarks and toy environments. In this work, we scale disentangled representation learning and associative memory to more visually and dynamically complex manipulation tasks and demonstrate zero-shot adaptability to visual perturbations in both simulation and on real hardware. We further extend this approach to imitation learning, specifically Diffusion Policy, and empirically show significant gains in visual generalization compared to state-of-the-art imitation learning methods. Finally, we introduce a novel technique adapted from the model equivariance literature that transforms any trained neural network policy into one invariant to 2D planar rotations, making our policy not only visually robust but also resilient to certain camera perturbations. We believe that this work marks a significant step towards manipulation policies that are not only adaptable out of the box, but also robust to the complexities and dynamical nature of real-world deployment. Supplementary videos are available at https://sites.google.com/view/vis-gen-robotics/home.

Problem

Research questions and friction points this paper is trying to address.

Enhancing vision-based robot manipulation robustness across diverse visual environments

Scaling disentangled representation learning for complex manipulation tasks

Improving visual generalization in imitation learning with novel techniques

Innovation

Methods, ideas, or system contributions that make the work stand out.

Scaling disentangled representation learning for complex tasks

Extending approach to imitation learning like Diffusion Policy

Introducing 2D planar rotation invariant neural network policy

🔎 Similar Papers

What Foundation Models can Bring for Robot Learning in Manipulation : A Survey