π€ AI Summary
This work addresses the challenge of identity leakage and motion distortion in multi-character animation, which arises from entanglement between identity representations and pose dynamics. To resolve this, the paper proposes a diffusion Transformer-based framework capable of animating an arbitrary number of characters. The approach introduces an Instance-Isolated Latent Representation (IILR) and a novel Three-Stage Decoupled Attention (TSDA) mechanism, complemented by an Adaptive Gating Fusion (AGF) module, to achieve precise and spatiotemporally consistent binding between identity and driving poses. This design effectively mitigates identity-pose mismatches and ambiguity in overlapping regions within multi-character scenes, enabling scalable generation of high-fidelity animations with strong identity consistency and controllable motion.
π Abstract
Controllable character animation has advanced rapidly in recent years, yet multi-character animation remains underexplored. As the number of characters grows, multi-character reference encoding becomes more susceptible to latent identity entanglement, resulting in identity bleeding and reduced controllability. Moreover, learning precise and spatio-temporally consistent correspondences between reference identities and driving pose sequences becomes increasingly challenging, often leading to identity-pose mis-binding and inconsistency in generated videos. To address these challenges, we propose AnyCrowd, a Diffusion Transformer (DiT)-based video generation framework capable of scaling to an arbitrary number of characters. Specifically, we first introduce an Instance-Isolated Latent Representation (IILR), which encodes character instances independently prior to DiT processing to prevent latent identity entanglement. Building on this disentangled representation, we further propose Tri-Stage Decoupled Attention (TSDA) to bind identities to driving poses by decomposing self-attention into: (i) instance-aware foreground attention, (ii) background-centric interaction, and (iii) global foreground-background coordination. Furthermore, to mitigate token ambiguity in overlapping regions, an Adaptive Gated Fusion (AGF) module is integrated within TSDA to predict identity-aware weights, effectively fusing competing token groups into identity-consistent representations...