McSc: Motion-Corrective Preference Alignment for Video Generation with Self-Critic Hierarchical Reasoning

📅 2025-11-28

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Text-to-video (T2V) generation suffers from misalignment with human preferences: existing approaches rely on manual annotations or proxy metrics, neglecting the intrinsic multidimensionality of preferences—e.g., trade-offs between motion dynamics and visual fidelity—leading to low-motion bias. To address this, we propose McSc, the first three-stage reinforcement learning framework integrating self-critical dimensional reasoning and hierarchical comparative reasoning. Its core innovation is a motion-corrected preference optimization strategy enabling fine-grained, dynamic modeling of multidimensional preferences. Technically, McSc unifies a generative reward model, self-critical reasoning chains, hierarchical reward supervision, and dynamically reweighted direct preference optimization, alongside a structured, multidimensional evaluation protocol. Experiments demonstrate substantial improvements in human preference alignment: McSc preserves high visual quality while significantly enhancing motion richness and temporal coherence.

Technology Category

Application Category

📝 Abstract

Text-to-video (T2V) generation has achieved remarkable progress in producing high-quality videos aligned with textual prompts. However, aligning synthesized videos with nuanced human preference remains challenging due to the subjective and multifaceted nature of human judgment. Existing video preference alignment methods rely on costly human annotations or utilize proxy metrics to predict preference, which lacks the understanding of human preference logic. Moreover, they usually directly align T2V models with the overall preference distribution, ignoring potential conflict dimensions like motion dynamics and visual quality, which may bias models towards low-motion content. To address these issues, we present Motion-corrective alignment with Self-critic hierarchical Reasoning (McSc), a three-stage reinforcement learning framework for robust preference modeling and alignment. Firstly, Self-critic Dimensional Reasoning (ScDR) trains a generative reward model (RM) to decompose preferences into per-dimension assessments, using self-critic reasoning chains for reliable learning. Secondly, to achieve holistic video comparison, we introduce Hierarchical Comparative Reasoning (HCR) for structural multi-dimensional reasoning with hierarchical reward supervision. Finally, using RM-preferred videos, we propose Motion-corrective Direct Preference Optimization (McDPO) to optimize T2V models, while dynamically re-weighting alignment objective to mitigate bias towards low-motion content. Experiments show that McSc achieves superior performance in human preference alignment and generates videos with high-motion dynamic.

Problem

Research questions and friction points this paper is trying to address.

Aligns synthesized videos with nuanced human preferences

Addresses conflicts between motion dynamics and visual quality

Reduces bias towards low-motion content in video generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-critic hierarchical reasoning for preference decomposition

Hierarchical comparative reasoning for multi-dimensional video assessment

Motion-corrective optimization to reduce low-motion bias

🔎 Similar Papers

Aligning Human Motion Generation with Human Perceptions

2024-07-02arXiv.orgCitations: 0

ByTheWay: Boost Your Text-to-Video Generation Model to Higher Quality in a Training-free Way

2024-10-08Citations: 0