🤖 AI Summary
This work addresses the limitations of existing interactive head generation methods, which often fail to produce contextually appropriate and emotionally plausible facial behaviors due to insufficient long-range contextual modeling, while dual-stream signal fusion frequently compromises lip-sync accuracy. To overcome these challenges, we propose ECHO, a novel framework that enhances contextual and emotional coherence through a long-range context understanding module. Furthermore, ECHO introduces a spatially aware decoupled cross-attention mechanism that effectively integrates user behavioral cues without degrading lip synchronization fidelity. Coupled with a two-stage training strategy, our approach significantly outperforms state-of-the-art methods in terms of lip-sync accuracy, visual fidelity, and contextual-emotional plausibility.
📝 Abstract
In natural face-to-face interaction, participants seamlessly alternate between speaking and listening, producing facial behaviors (FBs) that are finely informed by long-range context and naturally exhibit contextual appropriateness and emotional rationality. Interactive Head Generation (IHG) aims to synthesize lifelike avatar head video emulating such capabilities. Existing IHG methods typically condition on dual-track signals (i.e., human user's behaviors and pre-defined audio for avatar) within a short temporal window, jointly driving generation of avatar's audio-aligned lip articulation and non-verbal FBs. However, two main challenges persist in these methods: (i) the reliance on short-clip behavioral cues without long-range contextual modeling leads them to produce facial behaviors lacking contextual appropriateness; and (ii) the entangled, role-agnostic fusion of dual-track signals empirically introduces cross-signal interference, potentially compromising lip-region synchronization during speaking. To this end, we propose ECHO, a novel IHG framework comprising two key components: a Long-range Contextual Understanding (LCU) component that facilitates contextual understanding of both behavior-grounded dynamics and linguistic-driven affective semantics to promote contextual appropriateness and emotional rationality of synthesized avatar FBs; and a block-wise Spatial-aware Decoupled Cross-attention Modulation (SDCM) module, that preserves self-audio-driven lip articulation while adaptively integrating user contextual behavioral cues for non-lip facial regions, complemented by our designed two-stage training paradigm, to jointly enhance lip synchronization and visual fidelity. Extensive experiments demonstrate the effectiveness of proposed components and ECHO's superior IHG performance.