Generative Action Tell-Tales: Assessing Human Motion in Synthesized Videos

πŸ“… 2025-12-01
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing video generation evaluation metrics are susceptible to appearance bias and lack the capacity to model temporal dynamics and anatomical plausibility of complex human motion. This paper proposes the first latent-space evaluation metric that jointly leverages appearance-agnostic skeletal geometric features and appearance features. By integrating multimodal feature alignment, temporal modeling, and latent-space distance measurement, our method jointly assesses motion plausibility and temporal coherence. Crucially, it explicitly incorporates skeletal motion modeling into video fidelity assessmentβ€”a novel departure from purely visual or multimodal large language model (MLLM)-based approaches, which suffer from limitations in temporal reasoning and anatomical consistency. Evaluated on a newly constructed high-challenge benchmark, our method outperforms state-of-the-art methods by over 68%; it also achieves significant gains on external benchmarks and demonstrates stronger correlation with human perceptual judgments.

Technology Category

Application Category

πŸ“ Abstract
Despite rapid advances in video generative models, robust metrics for evaluating visual and temporal correctness of complex human actions remain elusive. Critically, existing pure-vision encoders and Multimodal Large Language Models (MLLMs) are strongly appearance-biased, lack temporal understanding, and thus struggle to discern intricate motion dynamics and anatomical implausibilities in generated videos. We tackle this gap by introducing a novel evaluation metric derived from a learned latent space of real-world human actions. Our method first captures the nuances, constraints, and temporal smoothness of real-world motion by fusing appearance-agnostic human skeletal geometry features with appearance-based features. We posit that this combined feature space provides a robust representation of action plausibility. Given a generated video, our metric quantifies its action quality by measuring the distance between its underlying representations and this learned real-world action distribution. For rigorous validation, we develop a new multi-faceted benchmark specifically designed to probe temporally challenging aspects of human action fidelity. Through extensive experiments, we show that our metric achieves substantial improvement of more than 68% compared to existing state-of-the-art methods on our benchmark, performs competitively on established external benchmarks, and has a stronger correlation with human perception. Our in-depth analysis reveals critical limitations in current video generative models and establishes a new standard for advanced research in video generation.
Problem

Research questions and friction points this paper is trying to address.

Evaluates visual and temporal correctness of generated human actions
Addresses appearance bias and lack of temporal understanding in current metrics
Measures action plausibility by comparing to real-world motion distributions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fuses skeletal geometry with appearance features
Learns latent space from real-world human actions
Measures distance to real-world action distribution
πŸ”Ž Similar Papers
No similar papers found.