π€ AI Summary
This work addresses the challenges of long-term temporal modeling, variable-length textual descriptions, and dynamic participant counts in text-driven multi-person motion generation by proposing the first autoregressive framework based on diffusion models. The method employs hierarchical interaction modeling to decouple local motion semantics from interpersonal dynamics within a normalized latent space and introduces a sliding window mechanism for efficient online generation. Key innovations include an autoregressive diffusion architecture that accommodates dynamic numbers of agents and long text inputs, a disentangled representation learning scheme, and a conditional aggregation strategy. Evaluated on the InterHuman benchmark, the proposed approach achieves an FID of 3.100, substantially outperforming the previous state-of-the-art result of 5.154, matching the performance of strong offline models while surpassing existing autoregressive methods.
π Abstract
Text-driven multi-human motion generation with complex interactions remains a challenging problem. Despite progress in performance, existing offline methods that generate fixed-length motions with a fixed number of agents, are inherently limited in handling long or variable text, and varying agent counts. These limitations naturally encourage autoregressive formulations, which predict future motions step by step conditioned on all past trajectories and current text guidance. In this work, we introduce HINT, the first autoregressive framework for multi-human motion generation with Hierarchical INTeraction modeling in diffusion. First, HINT leverages a disentangled motion representation within a canonicalized latent space, decoupling local motion semantics from inter-person interactions. This design facilitates direct adaptation to varying numbers of human participants without requiring additional refinement. Second, HINT adopts a sliding-window strategy for efficient online generation, and aggregates local within-window and global cross-window conditions to capture past human history, inter-person dependencies, and align with text guidance. This strategy not only enables fine-grained interaction modeling within each window but also preserves long-horizon coherence across all the long sequence. Extensive experiments on public benchmarks demonstrate that HINT matches the performance of strong offline models and surpasses autoregressive baselines. Notably, on InterHuman, HINT achieves an FID of 3.100, significantly improving over the previous state-of-the-art score of 5.154.