WPT: World-to-Policy Transfer via Online World Model Distillation

📅 2025-11-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing world models suffer from two key bottlenecks in policy learning: tight runtime coupling and heavy reliance on offline reward signals—hindering end-to-end optimization and real-time inference. This paper proposes a “World-to-Policy Transfer” framework guided by an online world model. First, we construct an end-to-end differentiable world model. Second, we introduce a trainable reward model that generates online teaching signals by dynamically aligning candidate trajectories with world-model predictions. Third, we design a dual distillation mechanism—world-knowledge distillation and reward-signal distillation—to efficiently inject the world model’s dynamic priors and reward logic into a lightweight policy network. Evaluated on open-loop and closed-loop autonomous driving benchmarks, our method achieves state-of-the-art performance: collision rate of 0.11, driving score of 79.23, and up to 4.9× faster inference—demonstrating superior safety and real-time capability.

Technology Category

Application Category

📝 Abstract
Recent years have witnessed remarkable progress in world models, which primarily aim to capture the spatio-temporal correlations between an agent's actions and the evolving environment. However, existing approaches often suffer from tight runtime coupling or depend on offline reward signals, resulting in substantial inference overhead or hindering end-to-end optimization. To overcome these limitations, we introduce WPT, a World-to-Policy Transfer training paradigm that enables online distillation under the guidance of an end-to-end world model. Specifically, we develop a trainable reward model that infuses world knowledge into a teacher policy by aligning candidate trajectories with the future dynamics predicted by the world model. Subsequently, we propose policy distillation and world reward distillation to transfer the teacher's reasoning ability into a lightweight student policy, enhancing planning performance while preserving real-time deployability. Extensive experiments on both open-loop and closed-loop benchmarks show that our WPT achieves state-of-the-art performance with a simple policy architecture: it attains a 0.11 collision rate (open-loop) and achieves a 79.23 driving score (closed-loop) surpassing both world-model-based and imitation-learning methods in accuracy and safety. Moreover, the student sustains up to 4.9x faster inference, while retaining most of the gains.
Problem

Research questions and friction points this paper is trying to address.

Reducing inference overhead in world models
Enabling online distillation with end-to-end optimization
Transferring reasoning ability to lightweight policies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Online distillation via end-to-end world model guidance
Trainable reward model aligning trajectories with predictions
Policy distillation transferring reasoning to lightweight student
🔎 Similar Papers
No similar papers found.