Hallo4: High-Fidelity Dynamic Portrait Animation via Direct Preference Optimization and Temporal Motion Modulation

📅 2025-05-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the challenge of generating high-fidelity portrait animations driven jointly by audio and skeletal motion. The proposed method achieves precise speech–lip synchronization, natural facial expression modeling, and faithful dynamic body motion reproduction. Its key contributions are: (1) the first direct preference optimization framework tailored for portrait animation, explicitly aligning model outputs with human perceptual preferences; (2) a temporal motion modulation module that mitigates spatiotemporal resolution mismatch via channel-wise temporal redistribution and scale-aware feature expansion; and (3) a diffusion-based architecture (UNet/DiT) integrating latent-space motion conditioning with high-dimensional temporal feature modeling. Experiments demonstrate significant improvements over baselines in lip-sync accuracy, facial expressiveness, and body motion coherence. Human preference evaluations across multiple metrics—e.g., naturalness, synchrony, and motion fidelity—show substantial gains, validating the effectiveness of the approach.

Technology Category

Application Category

📝 Abstract
Generating highly dynamic and photorealistic portrait animations driven by audio and skeletal motion remains challenging due to the need for precise lip synchronization, natural facial expressions, and high-fidelity body motion dynamics. We propose a human-preference-aligned diffusion framework that addresses these challenges through two key innovations. First, we introduce direct preference optimization tailored for human-centric animation, leveraging a curated dataset of human preferences to align generated outputs with perceptual metrics for portrait motion-video alignment and naturalness of expression. Second, the proposed temporal motion modulation resolves spatiotemporal resolution mismatches by reshaping motion conditions into dimensionally aligned latent features through temporal channel redistribution and proportional feature expansion, preserving the fidelity of high-frequency motion details in diffusion-based synthesis. The proposed mechanism is complementary to existing UNet and DiT-based portrait diffusion approaches, and experiments demonstrate obvious improvements in lip-audio synchronization, expression vividness, body motion coherence over baseline methods, alongside notable gains in human preference metrics. Our model and source code can be found at: https://github.com/xyz123xyz456/hallo4.
Problem

Research questions and friction points this paper is trying to address.

Achieving precise lip sync in audio-driven portrait animation
Generating natural facial expressions for photorealistic animations
Preserving high-fidelity body motion dynamics in synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Direct preference optimization for human-centric animation
Temporal motion modulation for spatiotemporal resolution alignment
Complementary to UNet and DiT-based diffusion approaches
🔎 Similar Papers
No similar papers found.
Jiahao Cui
Jiahao Cui
Huazhong University of Science and Technology
Computer visionDeep LearningComputational Photography
Y
Yan Chen
Fudan University
M
Mingwang Xu
Fudan University
H
Hanlin Shang
Fudan University
Y
Yuxuan Chen
Fudan University
Y
Yun Zhan
Fudan University
Zilong Dong
Zilong Dong
Institute for Intelligent Computing, Alibaba Group
NeRF3D Human3D Generation3D Understanding
Y
Yao Yao
Nanjing University
J
Jingdong Wang
Baidu Inc.
Siyu Zhu
Siyu Zhu
LinkedIn
LLM | Ranking