🤖 AI Summary
Existing single-image 3D face reconstruction methods typically embed emotion implicitly within geometry or appearance, making it difficult to achieve consistent emotional control across identities. This work proposes a dual-path modulation mechanism that, without altering existing feed-forward architectures, introduces emotion as an explicit and independent primary control signal into the reconstruction pipeline for the first time. By combining geometric modulation—via emotion-conditioned normalization—and appearance modulation—designed to capture identity-aware emotional visual cues—the approach effectively disentangles emotion from speech-driven facial dynamics and enables cross-identity emotion transfer. Integrated into multiple state-of-the-art backbone networks, the method maintains high-fidelity reconstruction and reenactment capabilities while supporting controllable emotion transfer, smooth interpolation, and disentangled manipulation.
📝 Abstract
We present a framework for explicit emotion control in feed-forward, single-image 3D head avatar reconstruction. Unlike existing pipelines where emotion is implicitly entangled with geometry or appearance, we treat emotion as a first-class control signal that can be manipulated independently and consistently across identities. Our method injects emotion into existing feed-forward architectures via a dual-path modulation mechanism without modifying their core design. Geometry modulation performs emotion-conditioned normalization in the original parametric space, disentangling emotional state from speech-driven articulation, while appearance modulation captures identity-aware, emotion-dependent visual cues beyond geometry. To enable learning under this setting, we construct a time-synchronized, emotion-consistent multi-identity dataset by transferring aligned emotional dynamics across identities. Integrated into multiple state-of-the-art backbones, our framework preserves reconstruction and reenactment fidelity while enabling controllable emotion transfer, disentangled manipulation, and smooth emotion interpolation, advancing expressive and scalable 3D head avatars.