🤖 AI Summary
Existing character animation methods (e.g., Animate Anyone) generate consistent and generalizable motions but neglect physical and semantic interactions between characters and their environments, leading to implausible actions. This work is the first to explicitly model environmental affordances as conditioning signals in diffusion-based generation, ensuring spatial and behavioral consistency between character poses and scene geometry/semantics. We introduce three core designs: (1) a shape-agnostic masking strategy to decouple environment regions; (2) an object-guided spatial feature fusion mechanism; and (3) a pose-driven feature modulation module. These components jointly optimize environment region encoding, object-aware feature extraction, and spatially adaptive feature injection. Experiments demonstrate significant improvements in motion plausibility and visual fidelity across diverse interactive scenarios. Both qualitative and quantitative evaluations surpass state-of-the-art methods.
📝 Abstract
Recent character image animation methods based on diffusion models, such as Animate Anyone, have made significant progress in generating consistent and generalizable character animations. However, these approaches fail to produce reasonable associations between characters and their environments. To address this limitation, we introduce Animate Anyone 2, aiming to animate characters with environment affordance. Beyond extracting motion signals from source video, we additionally capture environmental representations as conditional inputs. The environment is formulated as the region with the exclusion of characters and our model generates characters to populate these regions while maintaining coherence with the environmental context. We propose a shape-agnostic mask strategy that more effectively characterizes the relationship between character and environment. Furthermore, to enhance the fidelity of object interactions, we leverage an object guider to extract features of interacting objects and employ spatial blending for feature injection. We also introduce a pose modulation strategy that enables the model to handle more diverse motion patterns. Experimental results demonstrate the superior performance of the proposed method.