🤖 AI Summary
In 3D human motion prediction, density estimation inference is often prohibitively slow—frequently exceeding the target prediction horizon. To address this, we propose a two-stage cached normalizing flow framework: first, precomputing and caching the unconditional latent distribution of a normalizing flow model; second, employing a lightweight trajectory-encoding network to rapidly project historical motion sequences into a Gaussian mixture latent space, enabling millisecond-scale conditional sampling. This work introduces the novel “cache-acceleration” paradigm, decoupling density modeling from conditional inference without sacrificing accuracy or representational capacity. Our method achieves 4–30× faster inference than state-of-the-art approaches (≈1 ms), while matching their motion prediction accuracy on Human3.6M and AMASS and surpassing them in density estimation fidelity. Code and models are publicly available.
📝 Abstract
Many density estimation techniques for 3D human motion prediction require a significant amount of inference time, often exceeding the duration of the predicted time horizon. To address the need for faster density estimation for 3D human motion prediction, we introduce a novel flow-based method for human motion prediction called CacheFlow. Unlike previous conditional generative models that suffer from time efficiency, CacheFlow takes advantage of an unconditional flow-based generative model that transforms a Gaussian mixture into the density of future motions. The results of the computation of the flow-based generative model can be precomputed and cached. Then, for conditional prediction, we seek a mapping from historical trajectories to samples in the Gaussian mixture. This mapping can be done by a much more lightweight model, thus saving significant computation overhead compared to a typical conditional flow model. In such a two-stage fashion and by caching results from the slow flow model computation, we build our CacheFlow without loss of prediction accuracy and model expressiveness. This inference process is completed in approximately one millisecond, making it 4 times faster than previous VAE methods and 30 times faster than previous diffusion-based methods on standard benchmarks such as Human3.6M and AMASS datasets. Furthermore, our method demonstrates improved density estimation accuracy and comparable prediction accuracy to a SOTA method on Human3.6M. Our code and models will be publicly available.