🤖 AI Summary
Real-time Model Predictive Control (MPC) for edge deployment requires solving dozens to hundreds of nonlinear trajectory optimization problems per batch under strict latency constraints; existing GPU-accelerated approaches fail to simultaneously ensure real-time performance and high batch throughput. Method: We propose a holistic algorithm–software–hardware co-design framework: (i) a hierarchical parallelization scheme integrating block-level, warp-level, and thread-level fine-grained concurrency to support both single- and cross-iteration asynchronous batch solving; (ii) a dynamic memory optimization strategy to maximize GPU resource utilization; and (iii) a general-purpose nonlinear optimization kernel implemented in CUDA. Contribution/Results: Experiments show 18–21× speedup over CPU baselines and 1.4–16× improvement over state-of-the-art GPU methods. The approach significantly enhances convergence robustness and disturbance rejection capability, and is validated on an industrial robotic arm platform.
📝 Abstract
While Model Predictive Control (MPC) delivers strong performance across robotics applications, solving the underlying (batches of) nonlinear trajectory optimization (TO) problems online remains computationally demanding. Existing GPU-accelerated approaches typically (i) parallelize a single solve to meet real-time deadlines, (ii) scale to very large batches at slower-than-real-time rates, or (iii) achieve speed by restricting model generality (e.g., point-mass dynamics or a single linearization). This leaves a large gap in solver performance for many state-of-the-art MPC applications that require real-time batches of tens to low-hundreds of solves. As such, we present GATO, an open source, GPU-accelerated, batched TO solver co-designed across algorithm, software, and computational hardware to deliver real-time throughput for these moderate batch size regimes. Our approach leverages a combination of block-, warp-, and thread-level parallelism within and across solves for ultra-high performance. We demonstrate the effectiveness of our approach through a combination of: simulated benchmarks showing speedups of 18-21x over CPU baselines and 1.4-16x over GPU baselines as batch size increases; case studies highlighting improved disturbance rejection and convergence behavior; and finally a validation on hardware using an industrial manipulator. We open source GATO to support reproducibility and adoption.