🤖 AI Summary
To address structural distortions and temporal incoherence in cinematic-grade character animation—particularly under complex motion and cross-identity transfer—this paper proposes a novel framework integrating 3D-consistent pose representation with a context-aware diffusion Transformer. Methodologically, it introduces: (1) a geometrically robust 3D joint-bone joint pose encoding to preserve motion structure fidelity; (2) a full-sequence in-context pose injection mechanism to enhance long-range temporal modeling; and (3) a dedicated data pipeline and evaluation benchmark tailored for high-fidelity animation generation. Experimental results demonstrate state-of-the-art performance across multiple quantitative metrics, yielding substantial improvements in visual realism, motion stability, and cross-identity generalization. The framework provides a scalable, production-ready technical pathway for AI-driven cinematic animation synthesis.
📝 Abstract
Achieving character animation that meets studio-grade production standards remains challenging despite recent progress. Existing approaches can transfer motion from a driving video to a reference image, but often fail to preserve structural fidelity and temporal consistency in wild scenarios involving complex motion and cross-identity animations. In this work, we present extbf{SCAIL} ( extbf{S}tudio-grade extbf{C}haracter extbf{A}nimation via extbf{I}n-context extbf{L}earning), a framework designed to address these challenges from two key innovations. First, we propose a novel 3D pose representation, providing a more robust and flexible motion signal. Second, we introduce a full-context pose injection mechanism within a diffusion-transformer architecture, enabling effective spatio-temporal reasoning over full motion sequences. To align with studio-level requirements, we develop a curated data pipeline ensuring both diversity and quality, and establish a comprehensive benchmark for systematic evaluation. Experiments show that extbf{SCAIL} achieves state-of-the-art performance and advances character animation toward studio-grade reliability and realism.