Proximal Policy Distillation

📅 2024-07-21

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

156K/year

🤖 AI Summary

To address low sample efficiency, passive student exploration, and weak generalization caused by imperfect teacher demonstrations in policy distillation, this paper proposes Proximal Policy Distillation (PPD). PPD is the first to integrate Proximal Policy Optimization (PPO) into the policy distillation framework, unifying student-driven distillation with PPO’s self-feedback optimization mechanism—enabling active student exploration and leveraging auxiliary rewards collected during distillation. A KL-divergence constraint ensures behavioral consistency between teacher and student policies, while PPD supports multi-scale student networks (smaller, equal-sized, or larger than the teacher). Evaluated on ATARI, MuJoCo, and Procgen benchmarks, PPD consistently outperforms both student-distill and teacher-distill variants in both sample efficiency and final policy performance. Notably, it demonstrates superior robustness and generalization under suboptimal teacher demonstrations. The implementation is open-sourced as sb3-distill, built upon stable-baselines3.

Technology Category

Application Category

📝 Abstract

We introduce Proximal Policy Distillation (PPD), a novel policy distillation method that integrates student-driven distillation and Proximal Policy Optimization (PPO) to increase sample efficiency and to leverage the additional rewards that the student policy collects during distillation. To assess the efficacy of our method, we compare PPD with two common alternatives, student-distill and teacher-distill, over a wide range of reinforcement learning environments that include discrete actions and continuous control (ATARI, Mujoco, and Procgen). For each environment and method, we perform distillation to a set of target student neural networks that are smaller, identical (self-distillation), or larger than the teacher network. Our findings indicate that PPD improves sample efficiency and produces better student policies compared to typical policy distillation approaches. Moreover, PPD demonstrates greater robustness than alternative methods when distilling policies from imperfect demonstrations. The code for the paper is released as part of a new Python library built on top of stable-baselines3 to facilitate policy distillation: `sb3-distill'.

Problem

Research questions and friction points this paper is trying to address.

Improves sample efficiency in policy distillation

Enhances student policy performance via rewards

Robust distillation from imperfect demonstrations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates student-driven distillation with PPO

Improves sample efficiency and reward utilization

Robust for imperfect demonstration distillation

🔎 Similar Papers

No similar papers found.