🤖 AI Summary
This work addresses the challenges in portrait animation arising from the entanglement of identity, expression, and pose in RGB space, which leads to non-independent control, parameter redundancy, and heavy reliance on large training datasets. To overcome these limitations, the authors propose a novel paradigm based on an identity-orthogonal parametric face model. The method encodes expressions and poses from driving videos into disentangled parametric representations, which are rasterized into spatial maps as inputs to a diffusion model. Identity features from a pretrained diffusion model are incorporated via a lightweight key-value injection mechanism. Notably, this approach abandons RGB-based conditioning entirely, enabling cross-identity motion reenactment through simple coefficient substitution without requiring additional training data. Experiments demonstrate a 43% reduction in inference parameters and a 1,496-fold decrease in training sample requirements, achieving state-of-the-art or tied-best performance on newly defined metrics for pose trajectory fidelity and expression following.
📝 Abstract
Portrait animation transfers a driver clip's facial expression and head pose onto a single reference image while preserving the reference's identity. State-of-the-art diffusion systems address this by stacking trained modules for expression, pose, and identity in turn, paying for it in trainable parameters, proprietary corpora, and residual entanglement between the very axes the system is meant to control independently. This complexity compensates for an upstream choice -- learning facial expression and head pose from RGB, a representation in which identity, pose, and expression are inseparable without being learned apart. Loki steps out of RGB on the conditioning path. Driver expression and head pose are encoded by a face model whose parameter axes are identity-orthogonal by construction, then rasterised into a spatial map that the diffusion backbone consumes natively. Identity is routed separately through the diffusion backbone's own pretrained features via lightweight key-value injection. Because the parametric representation factorises identity from expression and pose, cross ID reenactment reduces to a coefficient substitution at inference, requiring no cross ID training data. Loki requires ~43% fewer inference parameters than leading diffusion baselines and trained on 1496x less video samples. We define two metrics that directly measure whether the generated head pose trajectory and facial expression followed the driver's -- the questions portrait animation actually asks; Loki leads or co-leads on both.