🤖 AI Summary
This work addresses the limited generalization of vision-augmented reinforcement learning policies in contact-rich robotic manipulation, which often overfit to training visual conditions. To overcome this, the authors propose a human-in-the-loop teacher–student distillation framework that transfers knowledge from a vision-dependent teacher policy to a vision-free student policy relying solely on pose, angular velocity, and force/torque sensing. This approach enables efficient training and strong generalization in real-world settings without requiring domain randomization or data augmentation. Evaluated on the NIST assembly benchmark, the method achieves a 95% success rate across three tasks after only approximately 50 minutes of training and successfully generalizes to eight unseen task variants. With minimal fine-tuning, it attains a 100% success rate on the most challenging task, significantly outperforming baseline methods.
📝 Abstract
When using reinforcement learning (RL) for contact-rich robotic manipulation, vision can provide task-relevant information that accelerates learning beyond what proprioception alone can achieve. However, vision-enabled policies tend to overfit to the visual conditions seen during training, limiting their robustness and transferability. We present a human-in-the-loop RL framework that employs teacher-student distillation to achieve robust performance across multiple task variants, trained entirely in the real world without requiring domain randomization or data augmentation. A vision-enabled teacher distills its knowledge into a vision-free student that relies solely on pose, twist, and wrench sensing, combining fast training with strong task generalization. On the real-world NIST assembly benchmark board, our approach achieves 95\% overall success after approximately 50 minutes of training on 3 representative tasks, including robust generalization to 8 unseen task variants. Fine-tuning with distillation achieves full success on the most challenging task. We demonstrate that the resulting policies outperform baselines in both robustness and adaptability.