🤖 AI Summary
Existing video generation evaluation benchmarks (e.g., TikTok, TED-Talks) lack fine-grained annotations of human motion and interaction, hindering systematic assessment of controllable image-to-video models. To address this, we introduce WYD—the first benchmark dedicated to controllable human video generation—comprising 1,544 manually annotated videos with 56 action/interaction categories. We propose the first structured evaluation framework specifically designed for human video generation, featuring nine automated, interpretable metrics spanning action fidelity, interaction plausibility, and motion coherence; these metrics achieve strong correlation with human judgments (ρ > 0.89). Crucially, semantic alignment-based consistency measures for action, interaction, and motion significantly enhance evaluation reliability. Comprehensive evaluation across seven state-of-the-art models uncovers critical limitations in controllability and physical plausibility. Our work establishes a reproducible, multi-dimensional, and fine-grained evaluation paradigm to advance research in controllable human video synthesis.
📝 Abstract
High-quality benchmarks are crucial for driving progress in machine learning research. However, despite the growing interest in video generation, there is no comprehensive dataset to evaluate human generation. Humans can perform a wide variety of actions and interactions, but existing datasets, like TikTok and TED-Talks, lack the diversity and complexity to fully capture the capabilities of video generation models. We close this gap by introducing `What Are You Doing?' (WYD): a new benchmark for fine-grained evaluation of controllable image-to-video generation of humans. WYD consists of 1{,}544 captioned videos that have been meticulously collected and annotated with 56 fine-grained categories. These allow us to systematically measure performance across 9 aspects of human generation, including actions, interactions and motion. We also propose and validate automatic metrics that leverage our annotations and better capture human evaluations. Equipped with our dataset and metrics, we perform in-depth analyses of seven state-of-the-art models in controllable image-to-video generation, showing how WYD provides novel insights about the capabilities of these models. We release our data and code to drive forward progress in human video generation modeling at https://github.com/google-deepmind/wyd-benchmark.