AlignHuman: Improving Motion and Fidelity via Timestep-Segment Preference Optimization for Audio-Driven Human Animation

📅 2025-06-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of simultaneously achieving motion naturalness and visual fidelity in audio-driven human animation. We propose the Temporal-Phase Preference Optimization (TPO) framework, whose core innovation lies in decoupling the diffusion denoising process into two phases: an early phase governing motion dynamics and a late phase emphasizing structural fidelity. To enable joint optimization, we introduce a dual-LoRA expert alignment module. The method integrates diffusion modeling, preference learning, low-rank adaptation (LoRA), and audio-motion alignment training. Evaluated on strong baselines, TPO achieves significant improvements in both motion naturalness and visual quality. Notably, inference cost is reduced from 100 to 30 NFEs (3.3× speedup), with near-lossless generation quality.

Technology Category

Application Category

📝 Abstract
Recent advancements in human video generation and animation tasks, driven by diffusion models, have achieved significant progress. However, expressive and realistic human animation remains challenging due to the trade-off between motion naturalness and visual fidelity. To address this, we propose extbf{AlignHuman}, a framework that combines Preference Optimization as a post-training technique with a divide-and-conquer training strategy to jointly optimize these competing objectives. Our key insight stems from an analysis of the denoising process across timesteps: (1) early denoising timesteps primarily control motion dynamics, while (2) fidelity and human structure can be effectively managed by later timesteps, even if early steps are skipped. Building on this observation, we propose timestep-segment preference optimization (TPO) and introduce two specialized LoRAs as expert alignment modules, each targeting a specific dimension in its corresponding timestep interval. The LoRAs are trained using their respective preference data and activated in the corresponding intervals during inference to enhance motion naturalness and fidelity. Extensive experiments demonstrate that AlignHuman improves strong baselines and reduces NFEs during inference, achieving a 3.3$ imes$ speedup (from 100 NFEs to 30 NFEs) with minimal impact on generation quality. Homepage: href{https://alignhuman.github.io/}{https://alignhuman.github.io/}
Problem

Research questions and friction points this paper is trying to address.

Balancing motion naturalness and visual fidelity in human animation
Optimizing denoising timesteps for motion dynamics and fidelity
Reducing computational cost while maintaining animation quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Timestep-segment preference optimization for motion and fidelity
Two specialized LoRAs as expert alignment modules
Divide-and-conquer training strategy for joint optimization
🔎 Similar Papers
No similar papers found.
C
Chao Liang
ByteDance
J
Jianwen Jiang
ByteDance
Wang Liao
Wang Liao
ByteDance
J
Jiaqi Yang
Zerong Zheng
Zerong Zheng
Bytedance
Computer VisionComputer Graphics
W
Weihong Zeng
H
Han Liang