Do You Need Proprioceptive States in Visuomotor Policies?

📅 2025-09-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates whether proprioceptive state inputs in visuomotor policies induce overfitting and impair spatial generalization. To address this, we propose a proprioception-free imitation learning framework that relies solely on visual observations from dual wide-angle wrist-mounted cameras and operates in a relative end-effector action space. Our core innovation lies in the co-design of task-relevant visual encoding and relative action modeling, which significantly enhances cross-spatial generalization: success rates reach 85% in vertical and 64% in horizontal generalization tasks—substantially outperforming baseline methods. The framework demonstrates strong cross-morphological adaptability and efficient data utilization across diverse real-world tasks—including grasp-and-place, shirt folding, and whole-body manipulation—and validates robust generalization across multiple robotic platforms.

Technology Category

Application Category

📝 Abstract
Imitation-learning-based visuomotor policies have been widely used in robot manipulation, where both visual observations and proprioceptive states are typically adopted together for precise control. However, in this study, we find that this common practice makes the policy overly reliant on the proprioceptive state input, which causes overfitting to the training trajectories and results in poor spatial generalization. On the contrary, we propose the State-free Policy, removing the proprioceptive state input and predicting actions only conditioned on visual observations. The State-free Policy is built in the relative end-effector action space, and should ensure the full task-relevant visual observations, here provided by dual wide-angle wrist cameras. Empirical results demonstrate that the State-free policy achieves significantly stronger spatial generalization than the state-based policy: in real-world tasks such as pick-and-place, challenging shirt-folding, and complex whole-body manipulation, spanning multiple robot embodiments, the average success rate improves from 0% to 85% in height generalization and from 6% to 64% in horizontal generalization. Furthermore, they also show advantages in data efficiency and cross-embodiment adaptation, enhancing their practicality for real-world deployment.
Problem

Research questions and friction points this paper is trying to address.

Proprioceptive states cause overfitting and poor spatial generalization in visuomotor policies
State-dependent policies fail to generalize across different spatial configurations and robot embodiments
Traditional approaches limit practical deployment due to reliance on proprioceptive state inputs
Innovation

Methods, ideas, or system contributions that make the work stand out.

State-free policy using only visual observations
Relative end-effector action space implementation
Dual wide-angle wrist cameras for full visibility
🔎 Similar Papers
No similar papers found.
J
Juntu Zhao
Shanghai Jiao Tong University, Spirit AI
W
Wenbo Lu
Spirit AI, New York University Shanghai
D
Di Zhang
Spirit AI, Tongji University
Y
Yufeng Liu
Shanghai Jiao Tong University, Spirit AI
Y
Yushen Liang
New York University Shanghai
T
Tianluo Zhang
New York University Shanghai
Y
Yifeng Cao
Spirit AI
Junyuan Xie
Junyuan Xie
University of Washington
Artificial IntelligenceMachine LearningComputer Vision
Yingdong Hu
Yingdong Hu
Institute for Interdisciplinary Information Sciences, Tsinghua University
computer visionrobotics
Shengjie Wang
Shengjie Wang
Tsinghua University
RoboticsReinforcement learningBionic robotics
Junliang Guo
Junliang Guo
Microsoft Research
Deep LearningGenerative ModelsNatural Language Processing
Dequan Wang
Dequan Wang
Shanghai Jiao Tong University
AI for ScienceAI4Science
Y
Yang Gao
Spirit AI, Tsinghua University