InterAgent: Physics-based Multi-agent Command Execution via Diffusion on Interaction Graphs

📅 2025-12-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing methods are predominantly limited to single-agent settings, failing to model physically plausible, fine-grained interactions and coordination among multiple agents. To address this, we propose the first end-to-end, text-driven physical multi-agent humanoid control framework. Our approach explicitly encodes joint-level spatial dependencies and inter-agent dynamic correlations via an interaction graph; enhances external perception modeling through a sparse edge attention mechanism; and decouples proprioceptive, environmental, and motor processing using an autoregressive diffusion Transformer with multi-stream modules. Evaluated on multiple benchmarks, our method significantly outperforms strong baselines, generating semantically coherent, temporally coordinated, and physically realistic multi-agent collaborative behaviors solely from natural language prompts. This work establishes the first unified modeling paradigm for language-to-physical multi-agent coordinated control.

Technology Category

Application Category

📝 Abstract
Humanoid agents are expected to emulate the complex coordination inherent in human social behaviors. However, existing methods are largely confined to single-agent scenarios, overlooking the physically plausible interplay essential for multi-agent interactions. To bridge this gap, we propose InterAgent, the first end-to-end framework for text-driven physics-based multi-agent humanoid control. At its core, we introduce an autoregressive diffusion transformer equipped with multi-stream blocks, which decouples proprioception, exteroception, and action to mitigate cross-modal interference while enabling synergistic coordination. We further propose a novel interaction graph exteroception representation that explicitly captures fine-grained joint-to-joint spatial dependencies to facilitate network learning. Additionally, within it we devise a sparse edge-based attention mechanism that dynamically prunes redundant connections and emphasizes critical inter-agent spatial relations, thereby enhancing the robustness of interaction modeling. Extensive experiments demonstrate that InterAgent consistently outperforms multiple strong baselines, achieving state-of-the-art performance. It enables producing coherent, physically plausible, and semantically faithful multi-agent behaviors from only text prompts. Our code and data will be released to facilitate future research.
Problem

Research questions and friction points this paper is trying to address.

Develops a framework for text-driven physics-based multi-agent humanoid control.
Captures fine-grained joint-to-joint spatial dependencies for multi-agent interactions.
Enhances robustness in modeling physically plausible multi-agent behaviors.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Autoregressive diffusion transformer with multi-stream blocks
Interaction graph exteroception capturing joint dependencies
Sparse edge-based attention pruning redundant connections
B
Bin Li
ShanghaiTech University
R
Ruichi Zhang
University of Pennsylvania
H
Han Liang
ByteDance
Jingyan Zhang
Jingyan Zhang
ShanghaiTech University
J
Juze Zhang
Stanford University
X
Xin Chen
ByteDance
L
Lan Xu
ShanghaiTech University
Jingyi Yu
Jingyi Yu
Professor, ShanghaiTech University
Computer VisionComputer Graphics
Jingya Wang
Jingya Wang
Assistant Professor, ShanghaiTech University
Computer VisionEmbodied AIHuman-Object Interaction