Student-Informed Teacher Training

📅 2024-12-12
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
In privileged imitation learning, students often fail to replicate teacher behaviors due to limited observational capabilities—stemming from a fundamental asymmetry: the teacher’s policy is not designed for the student’s partially observable setting. To address this, we propose a joint teacher-student training framework. First, we incorporate an action-divergence approximation term into the teacher’s reward function, theoretically grounded in performance bounds to mitigate imitation failure. Second, we introduce a supervised behavioral alignment step that explicitly constrains the teacher’s policy to be imitable by the student. Third, we optimize the entire system via vision-driven, end-to-end reinforcement learning. Evaluated on maze navigation, vision-guided quadrotor flight, and dexterous manipulation tasks, our approach yields substantial improvements in student policy performance, empirically validating the efficacy of enhancing teacher imitability.

Technology Category

Application Category

📝 Abstract
Imitation learning with a privileged teacher has proven effective for learning complex control behaviors from high-dimensional inputs, such as images. In this framework, a teacher is trained with privileged task information, while a student tries to predict the actions of the teacher with more limited observations, e.g., in a robot navigation task, the teacher might have access to distances to nearby obstacles, while the student only receives visual observations of the scene. However, privileged imitation learning faces a key challenge: the student might be unable to imitate the teacher's behavior due to partial observability. This problem arises because the teacher is trained without considering if the student is capable of imitating the learned behavior. To address this teacher-student asymmetry, we propose a framework for joint training of the teacher and student policies, encouraging the teacher to learn behaviors that can be imitated by the student despite the latters' limited access to information and its partial observability. Based on the performance bound in imitation learning, we add (i) the approximated action difference between teacher and student as a penalty term to the reward function of the teacher, and (ii) a supervised teacher-student alignment step. We motivate our method with a maze navigation task and demonstrate its effectiveness on complex vision-based quadrotor flight and manipulation tasks.
Problem

Research questions and friction points this paper is trying to address.

Addresses teacher-student asymmetry in imitation learning
Improves student imitation despite partial observability
Jointly trains teacher and student policies for alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Joint teacher-student policy training
Approximated action difference penalty
Supervised teacher-student alignment step