🤖 AI Summary
This work addresses the challenge that trajectory optimization solvers rely heavily on high-quality initial trajectories, yet solving them in isolation often leads to slow convergence and solution instability. Moreover, conventional diffusion-based strategies suffer from error accumulation over long horizons due to minor deviations. To overcome these limitations, this study introduces, for the first time, feedback gains from a trajectory optimizer into diffusion policy training and proposes a first-order gradient-based Sobolev loss function. By jointly supervising both trajectory samples and their derivatives, the method generates highly accurate initial guesses using only a small number of demonstration trajectories. This enables efficient inference with significantly fewer diffusion steps, markedly mitigates long-horizon error accumulation, and reduces subsequent optimization runtime by a factor of 2 to 20.
📝 Abstract
Trajectory Optimization (TO) solvers exploit known system dynamics to compute locally optimal trajectories through iterative improvements. A downside is that each new problem instance is solved independently; therefore, convergence speed and quality of the solution found depend on the initial trajectory proposed. To improve efficiency, a natural approach is to warm-start TO with initial guesses produced by a learned policy trained on trajectories previously generated by the solver. Diffusion-based policies have recently emerged as expressive imitation learning models, making them promising candidates for this role. Yet, a counterintuitive challenge comes from the local optimality of TO demonstrations: when a policy is rolled out, small non-optimal deviations may push it into situations not represented in the training data, triggering compounding errors over long horizons. In this work, we focus on learning-based warm-starting for gradient-based TO solvers that also provide feedback gains. Exploiting this specificity, we derive a first-order loss for Sobolev learning of diffusion-based policies using both trajectories and feedback gains. Through comprehensive experiments, we demonstrate that the resulting policy avoids compounding errors, and so can learn from very few trajectories to provide initial guesses reducing solving time by $2\times$ to $20 \times$. Incorporating first-order information enables predictions with fewer diffusion steps, reducing inference latency.