🤖 AI Summary
This work addresses the challenge of achieving fine-grained emotional control in talking head generation, which often results in facial expressions lacking realism. To this end, we propose AUHead, a two-stage emotion-controllable generation framework. First, a large audio language model disentangles fine-grained Action Units (AUs) from speech using spatiotemporal AU tokens and an emotion-to-AU chain-of-thought reasoning process. Subsequently, an AU-driven controllable diffusion model maps the AU sequence to a structured 2D facial representation, leveraging cross-attention mechanisms to model interactions between AUs and visual features. Our approach is the first to employ AUs as an intermediate representation for disentangling emotional cues from speech and introduces an AU-disentanglement guidance strategy that enhances emotional expressiveness while preserving identity consistency. Experiments demonstrate that our method significantly outperforms existing techniques in emotional authenticity, lip-sync accuracy, and visual coherence.
📝 Abstract
Realistic talking-head video generation is critical for virtual avatars, film production, and interactive systems. Current methods struggle with nuanced emotional expressions due to the lack of fine-grained emotion control. To address this issue, we introduce a novel two-stage method (AUHead) to disentangle fine-grained emotion control, i.e. , Action Units (AUs), from audio and achieve controllable generation. In the first stage, we explore the AU generation abilities of large audio-language models (ALMs), by spatial-temporal AU tokenization and an"emotion-then-AU"chain-of-thought mechanism. It aims to disentangle AUs from raw speech, effectively capturing subtle emotional cues. In the second stage, we propose an AU-driven controllable diffusion model that synthesizes realistic talking-head videos conditioned on AU sequences. Specifically, we first map the AU sequences into the structured 2D facial representation to enhance spatial fidelity, and then model the AU-vision interaction within cross-attention modules. To achieve flexible AU-quality trade-off control, we introduce an AU disentanglement guidance strategy during inference, further refining the emotional expressiveness and identity consistency of the generated videos. Results on benchmark datasets demonstrate that our approach achieves competitive performance in emotional realism, accurate lip synchronization, and visual coherence, significantly surpassing existing techniques. Our implementation is available at https://github.com/laura990501/AUHead_ICLR