🤖 AI Summary
In robot manipulation, the strong coupling between perception and control in end-to-end learning severely hinders sim-to-real transfer performance. To address this, we propose a decoupled learning framework: a general-purpose control policy is trained offline in simulation using privileged state information (e.g., object pose) and then frozen; in the real world, only a small number of demonstrations (10–20) are required to online align the perception module. This is the first approach to explicitly separate perception and control training, eliminating the instability inherent in cross-domain joint optimization. Experiments on tabletop manipulation tasks demonstrate that our method significantly outperforms end-to-end baselines, exhibits strong generalization to out-of-distribution object positions and scale variations, and achieves substantially improved data efficiency and transfer reliability.
📝 Abstract
Sim-to-real transfer remains a fundamental challenge in robot manipulation due to the entanglement of perception and control in end-to-end learning. We present a decoupled framework that learns each component where it is most reliable: control policies are trained in simulation with privileged state to master spatial layouts and manipulation dynamics, while perception is adapted only at deployment to bridge real observations to the frozen control policy. Our key insight is that control strategies and action patterns are universal across environments and can be learned in simulation through systematic randomization, while perception is inherently domain-specific and must be learned where visual observations are authentic. Unlike existing end-to-end approaches that require extensive real-world data, our method achieves strong performance with only 10-20 real demonstrations by reducing the complex sim-to-real problem to a structured perception alignment task. We validate our approach on tabletop manipulation tasks, demonstrating superior data efficiency and out-of-distribution generalization compared to end-to-end baselines. The learned policies successfully handle object positions and scales beyond the training distribution, confirming that decoupling perception from control fundamentally improves sim-to-real transfer.