DiffTOP: Differentiable Trajectory Optimization for Deep Reinforcement and Imitation Learning

📅 2024-02-08

🏛️ Neural Information Processing Systems

📈 Citations: 2

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This work addresses the low policy-generation efficiency and objective mismatch prevalent in model-based reinforcement learning (MBRL) and imitation learning. We propose a unified framework wherein policies are represented via differentiable trajectory optimization. Methodologically, we jointly learn the cost function and dynamics model end-to-end, optimizing model parameters directly through backpropagation of policy-gradient losses—enabling task-performance-driven learning. Notably, this is the first MBRL approach to explicitly resolve the “objective mismatch” problem. To enhance policy robustness, we innovatively integrate energy-based models with diffusion models to construct a contrastive learning mechanism. Extensive evaluation across 15 MBRL benchmark tasks and 35 high-dimensional visual/point-cloud imitation learning tasks demonstrates consistent superiority over state-of-the-art methods, validating the framework’s effectiveness and broad generalizability.

Technology Category

Application Category

📝 Abstract

This paper introduces DiffTORI, which utilizes Differentiable Trajectory Optimization as the policy representation to generate actions for deep Reinforcement and Imitation learning. Trajectory optimization is a powerful and widely used algorithm in control, parameterized by a cost and a dynamics function. The key to our approach is to leverage the recent progress in differentiable trajectory optimization, which enables computing the gradients of the loss with respect to the parameters of trajectory optimization. As a result, the cost and dynamics functions of trajectory optimization can be learned end-to-end. DiffTORI addresses the ``objective mismatch'' issue of prior model-based RL algorithms, as the dynamics model in DiffTORI is learned to directly maximize task performance by differentiating the policy gradient loss through the trajectory optimization process. We further benchmark DiffTORI for imitation learning on standard robotic manipulation task suites with high-dimensional sensory observations and compare our method to feed-forward policy classes as well as Energy-Based Models (EBM) and Diffusion. Across 15 model-based RL tasks and 35 imitation learning tasks with high-dimensional image and point cloud inputs, DiffTORI outperforms prior state-of-the-art methods in both domains. Our code is available at https://github.com/wkwan7/DiffTORI.

Problem

Research questions and friction points this paper is trying to address.

Addresses objective mismatch in model-based RL algorithms.

Learns cost and dynamics functions end-to-end via differentiable optimization.

Outperforms state-of-the-art methods in RL and imitation learning tasks.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Differentiable Trajectory Optimization for policy representation

End-to-end learning of cost and dynamics functions

Outperforms state-of-the-art in RL and imitation tasks

🔎 Similar Papers

No similar papers found.

Bosch Group

Renningen, BW, DE

Research Scientist Intern, Robotic Control Policy (PhD)