Diffusion Forcing for Multi-Agent Interaction Sequence Modeling

📅 2025-12-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Modeling multi-agent interactive sequences faces challenges including long-term temporal dependencies, strong inter-agent coupling, and variable group sizes, leading to poor generalization in existing methods. This paper proposes MAGNet, a unified autoregressive diffusion framework that enables coherent motion generation for dynamic group sizes (from two to many agents), long sequences (hundreds of frames), and diverse conditional settings. Its core innovation is Diffusion Forcing—a mechanism that explicitly models spatiotemporal coupling among agents—integrated with an enhanced Transformer architecture incorporating dynamic graph-structured attention and conditional encoding. Experiments demonstrate that MAGNet matches state-of-the-art specialized two-agent models on standard benchmarks, while uniquely achieving high spatiotemporal coherence and motion synchronization in multi-agent scenarios—including dance, boxing, and free-form social interaction—thereby overcoming fundamental limitations of fixed-role and fixed-size modeling paradigms.

Technology Category

Application Category

📝 Abstract
Understanding and generating multi-person interactions is a fundamental challenge with broad implications for robotics and social computing. While humans naturally coordinate in groups, modeling such interactions remains difficult due to long temporal horizons, strong inter-agent dependencies, and variable group sizes. Existing motion generation methods are largely task-specific and do not generalize to flexible multi-agent generation. We introduce MAGNet (Multi-Agent Diffusion Forcing Transformer), a unified autoregressive diffusion framework for multi-agent motion generation that supports a wide range of interaction tasks through flexible conditioning and sampling. MAGNet performs dyadic prediction, partner inpainting, and full multi-agent motion generation within a single model, and can autoregressively generate ultra-long sequences spanning hundreds of v. Building on Diffusion Forcing, we introduce key modifications that explicitly model inter-agent coupling during autoregressive denoising, enabling coherent coordination across agents. As a result, MAGNet captures both tightly synchronized activities (e.g, dancing, boxing) and loosely structured social interactions. Our approach performs on par with specialized methods on dyadic benchmarks while naturally extending to polyadic scenarios involving three or more interacting people, enabled by a scalable architecture that is agnostic to the number of agents. We refer readers to the supplemental video, where the temporal dynamics and spatial coordination of generated interactions are best appreciated. Project page: https://von31.github.io/MAGNet/
Problem

Research questions and friction points this paper is trying to address.

Modeling multi-person interactions with long temporal horizons
Generalizing motion generation across variable group sizes
Capturing inter-agent dependencies for coherent coordination
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified autoregressive diffusion framework for multi-agent motion generation
Explicitly models inter-agent coupling during autoregressive denoising
Scalable architecture agnostic to number of agents
🔎 Similar Papers
No similar papers found.
V
Vongani H. Maluleke
UC Berkeley
K
Kie Horiuchi
Sony Group Corporation
L
Lea Wilken
UC Berkeley
E
Evonne Ng
Meta
J
Jitendra Malik
UC Berkeley
Angjoo Kanazawa
Angjoo Kanazawa
UC Berkeley
Computer VisionComputer GraphicsMachine Learning