🤖 AI Summary
Existing video generation evaluation metrics struggle to capture fine-grained human details and exhibit limited alignment with human subjective preferences. This work proposes a coarse-to-fine, human-centric evaluation framework that first leverages vision-language models to assess global video quality and then integrates 2D pose estimation with 3D motion analysis to precisely evaluate anatomical correctness and motion stability of human figures. Introducing, for the first time, a hierarchical coarse-to-fine strategy into human video assessment, the method establishes a perceptually aligned framework that jointly considers global semantics and local details. The authors also release HuM-Bench, a benchmark comprising 1,000 diverse text prompts. Experimental results demonstrate that the proposed approach achieves an average correlation of 58.2% with human preferences, significantly outperforming current state-of-the-art methods and offering a systematic, reliable evaluation framework for human perception in text-to-video generation.
📝 Abstract
Video generation models have developed rapidly in recent years, where generating natural human motion plays a pivotal role. However, accurately evaluating the quality of generated human motion video remains a significant challenge. Existing evaluation metrics primarily focus on global scene statistics, often overlooking fine-grained human details and consequently failing to align with human subjective preference. To bridge this gap, we propose HuM-Eval, a novel human-centric evaluation framework that adopts a coarse-to-fine strategy. Specifically, our framework first utilizes a Vision Language Model to perform a coarse assessment of global video quality. It then proceeds to a fine-grained analysis, using 2D pose to verify anatomical correctness and 3D human motion to evaluate motion stability. Extensive experiments demonstrate that HuM-Eval achieves an average human correlation of 58.2%, outperforming state-of-the-art baselines. Furthermore, we introduce HuM-Bench, a comprehensive benchmark comprising 1,000 diverse prompts, and conduct a detailed evaluation of existing text-to-video models, paving the way for next-generation human motion generation.