Residual-MPPI: Online Policy Customization for Continuous Control

📅 2024-07-01

🏛️ arXiv.org

📈 Citations: 3

✨ Influential: 1

career value

188K/year

🤖 AI Summary

Existing fine-tuning methods for continuous-control policies lack online customization capability, as they require access to original training data and model parameters—rendering them infeasible for post-deployment adaptation to novel performance criteria. To address this, we propose Residual-MPPI: an online planning algorithm that operates without any knowledge of the original reinforcement learning or imitation learning training process, requiring only a prior action distribution. Its core innovation lies in integrating residual action modeling into the Model Predictive Path Integral (MPPI) framework, enabling zero-shot or few-shot policy redirection via stochastic trajectory sampling and weighted optimization. We demonstrate its efficacy by customizing the GT Sophy 1.0 agent in *Gran Turismo Sport* for real-time behavioral shifts—including energy-efficient driving and aggressive overtaking—and achieve superior performance over state-of-the-art baselines on MuJoCo benchmarks. This work establishes the first online policy customization method independent of original training information.

Technology Category

Application Category

📝 Abstract

Policies developed through Reinforcement Learning (RL) and Imitation Learning (IL) have shown great potential in continuous control tasks, but real-world applications often require adapting trained policies to unforeseen requirements. While fine-tuning can address such needs, it typically requires additional data and access to the original training metrics and parameters. In contrast, an online planning algorithm, if capable of meeting the additional requirements, can eliminate the necessity for extensive training phases and customize the policy without knowledge of the original training scheme or task. In this work, we propose a generic online planning algorithm for customizing continuous-control policies at the execution time, which we call Residual-MPPI. It can customize a given prior policy on new performance metrics in few-shot and even zero-shot online settings, given access to the prior action distribution alone. Through our experiments, we demonstrate that the proposed Residual-MPPI algorithm can accomplish the few-shot/zero-shot online policy customization task effectively, including customizing the champion-level racing agent, Gran Turismo Sophy (GT Sophy) 1.0, in the challenging car racing scenario, Gran Turismo Sport (GTS) environment. Code for MuJoCo experiments is included in the supplementary and will be open-sourced upon acceptance. Demo videos and code are available on our website: https://sites.google.com/view/residual-mppi.

Problem

Research questions and friction points this paper is trying to address.

Adapting RL/IL policies to unforeseen real-world requirements.

Customizing policies without original training data or parameters.

Enabling few-shot/zero-shot online policy customization in continuous control.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Online planning algorithm for policy customization

Customizes policies without original training data

Supports few-shot and zero-shot online settings

🔎 Similar Papers

No similar papers found.