CogPortrait: Fine-Grained Eye-Region Control in Portrait Animation via Hierarchical Agent Planning

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses the challenge of precisely controlling eye-region dynamics in portrait animation while maintaining input simplicity, particularly for non-emotional states such as thinking or drowsiness, which existing methods struggle to capture. The authors propose a two-stage framework: first, a chain-of-thought multimodal large language model agent translates high-level semantic labels into temporally coherent and physiologically plausible eye keypoints; second, a DiT-based video generation backbone synthesizes the final animation using a dynamic classifier-free guidance strategy. Key innovations include hierarchical agent-driven fine-grained eye motion control, joint semantic-physiological constraints, and an eye-region-aware reweighting mechanism. The study also introduces EMH, the first benchmark encompassing both emotional and non-emotional states. Experiments demonstrate significant improvements in eye-region control accuracy on HDTF and EMH datasets while preserving high visual quality and identity consistency.

📝 Abstract

Portrait animation methods have achieved substantial visual quality and lip synchronization, but fine-grained manipulation of the eye region still faces a trade-off between input granularity and motion accuracy. Existing methods using emotion labels or coarse text prompts are insufficient for describing subtle ocular dynamics, whereas approaches based on Action Units or driving videos provide higher fidelity at the cost of a heavier input burden. These limitations are still restrictive for beyond-emotion states (e.g., thinking) and drowsiness. In light of the above, we propose CogPortrait, a two-stage framework that generates portrait animations from high-level labels. In the first stage, three chain-of-thought Multimodal Large Language Models (MLLMs) agents compile high-level labels into facial keypoints through temporal event planning, prototype retrieval, and composition from a real-behavior library, and semantic-physiological constraint enforcement. In the second stage, a DiT-based video generation backbone synthesizes the final animation conditioned on the keypoints, reference portrait, audio, and text prompt, enhanced by a dynamic classifier-free guidance strategy with eye-region-aware reweighting and KTO-based refinement for boundary cases. We further introduce the EMH benchmark covering diverse emotions and beyond-emotion categories with two AU-level metrics for evaluating fine-grained eye-region and head-motion control. Extensive experiments on HDTF and the EMH benchmark demonstrate that CogPortrait achieves more precise eye-region control than existing methods while maintaining supe- rior visual quality and identity consistency

Problem

Research questions and friction points this paper is trying to address.

portrait animation

eye-region control

fine-grained manipulation

beyond-emotion states

oculomotor dynamics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Agent Planning

Fine-Grained Eye Control

Multimodal Large Language Models (MLLMs)