AnyCrowd: Instance-Isolated Identity-Pose Binding for Arbitrary Multi-Character Animation

πŸ“… 2026-03-16
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of identity leakage and motion distortion in multi-character animation, which arises from entanglement between identity representations and pose dynamics. To resolve this, the paper proposes a diffusion Transformer-based framework capable of animating an arbitrary number of characters. The approach introduces an Instance-Isolated Latent Representation (IILR) and a novel Three-Stage Decoupled Attention (TSDA) mechanism, complemented by an Adaptive Gating Fusion (AGF) module, to achieve precise and spatiotemporally consistent binding between identity and driving poses. This design effectively mitigates identity-pose mismatches and ambiguity in overlapping regions within multi-character scenes, enabling scalable generation of high-fidelity animations with strong identity consistency and controllable motion.

Technology Category

Application Category

πŸ“ Abstract
Controllable character animation has advanced rapidly in recent years, yet multi-character animation remains underexplored. As the number of characters grows, multi-character reference encoding becomes more susceptible to latent identity entanglement, resulting in identity bleeding and reduced controllability. Moreover, learning precise and spatio-temporally consistent correspondences between reference identities and driving pose sequences becomes increasingly challenging, often leading to identity-pose mis-binding and inconsistency in generated videos. To address these challenges, we propose AnyCrowd, a Diffusion Transformer (DiT)-based video generation framework capable of scaling to an arbitrary number of characters. Specifically, we first introduce an Instance-Isolated Latent Representation (IILR), which encodes character instances independently prior to DiT processing to prevent latent identity entanglement. Building on this disentangled representation, we further propose Tri-Stage Decoupled Attention (TSDA) to bind identities to driving poses by decomposing self-attention into: (i) instance-aware foreground attention, (ii) background-centric interaction, and (iii) global foreground-background coordination. Furthermore, to mitigate token ambiguity in overlapping regions, an Adaptive Gated Fusion (AGF) module is integrated within TSDA to predict identity-aware weights, effectively fusing competing token groups into identity-consistent representations...
Problem

Research questions and friction points this paper is trying to address.

multi-character animation
identity entanglement
identity-pose binding
controllability
spatio-temporal consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Instance-Isolated Latent Representation
Tri-Stage Decoupled Attention
Adaptive Gated Fusion
Multi-Character Animation
Diffusion Transformer
πŸ”Ž Similar Papers
No similar papers found.
Zhenyu Xie
Zhenyu Xie
MBZUAI, Sun Yat-set University
2D 3D GenerationDigital Human
J
Ji Xia
Mohamed bin Zayed University of Artificial Intelligence, UAE
M
Michael Kampffmeyer
University of TromsΓΈ (UiT) – The Arctic University of Norway, Norway
P
Panwen Hu
Mohamed bin Zayed University of Artificial Intelligence, UAE
Zehua Ma
Zehua Ma
University of Science and Technology of China
Image WatermarkingImage Processing3D PrintingAesthetic 2D Barcode
Yujian Zheng
Yujian Zheng
Mohamed bin Zayed University of Artificial Intelligence
Computer GraphicsComputer Vision
J
Jing Wang
Shenzhen campus of Sun Yat-sen University, China
Zheng Chong
Zheng Chong
Sun Yat-sen University, Ph.D.
Image & Video GenerationVirtual Try-On
Xujie Zhang
Xujie Zhang
Master of Sun Yat-sen University
mutil-modal
X
Xianhang Cheng
Mohamed bin Zayed University of Artificial Intelligence, UAE
Xiaodan Liang
Xiaodan Liang
Professor of Computer Science, Sun Yat-sen University, MBZUAI, CMU, NUS
Computer visionEmbodied AIMachine learning
H
Hao Li
Mohamed bin Zayed University of Artificial Intelligence, UAE