🤖 AI Summary
Existing video generation methods predominantly model isolated actions, failing to capture critical dynamic hand–face interactions essential for biometric anti-spoofing. To address this gap, we propose the first systematic framework for high-fidelity hand–face interaction animation generation. Our method introduces a region-aware diffusion model incorporating learnable spatiotemporal latent variables and a dynamic interaction prior, jointly optimized with physics-based contact modeling and anatomically plausible facial deformation. We further construct InterHF—the first large-scale hand–face interaction dataset—comprising 90,000 videos across 18 interaction patterns. Extensive experiments demonstrate that our approach significantly outperforms existing baselines in visual realism, temporal coherence, and anatomical plausibility. This work establishes a new benchmark for hand–face interaction animation. Both code and the InterHF dataset will be publicly released.
📝 Abstract
Recent video generation research has focused heavily on isolated actions, leaving interactive motions-such as hand-face interactions-largely unexamined. These interactions are essential for emerging biometric authentication systems, which rely on interactive motion-based anti-spoofing approaches. From a security perspective, there is a growing need for large-scale, high-quality interactive videos to train and strengthen authentication models. In this work, we introduce a novel paradigm for animating realistic hand-face interactions. Our approach simultaneously learns spatio-temporal contact dynamics and biomechanically plausible deformation effects, enabling natural interactions where hand movements induce anatomically accurate facial deformations while maintaining collision-free contact. To facilitate this research, we present InterHF, a large-scale hand-face interaction dataset featuring 18 interaction patterns and 90,000 annotated videos. Additionally, we propose InterAnimate, a region-aware diffusion model designed specifically for interaction animation. InterAnimate leverages learnable spatial and temporal latents to effectively capture dynamic interaction priors and integrates a region-aware interaction mechanism that injects these priors into the denoising process. To the best of our knowledge, this work represents the first large-scale effort to systematically study human hand-face interactions. Qualitative and quantitative results show InterAnimate produces highly realistic animations, setting a new benchmark. Code and data will be made public to advance research.