Human detectors are surprisingly powerful reward models

📅 2026-01-15

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Existing video generation models often suffer from missing limbs, distorted poses, or physically implausible motions when synthesizing complex non-rigid human actions such as dance or sports. This work proposes HuDA, a reward model that leverages the confidence scores of off-the-shelf human detectors as an efficient, fine-tuning-free reward signal—an insight not previously explored. Combined with a temporal prompt alignment score, HuDA guides post-training of generative models via Group Reward Policy Optimization (GRPO). Requiring no manual annotations, the approach achieves a 73% win rate over state-of-the-art models like Wan 2.1 on complex human motion generation and significantly enhances video quality in broader scenarios, including animal motion and human-object interactions.

Technology Category

Application Category

📝 Abstract

Video generation models have recently achieved impressive visual fidelity and temporal coherence. Yet, they continue to struggle with complex, non-rigid motions, especially when synthesizing humans performing dynamic actions such as sports, dance, etc. Generated videos often exhibit missing or extra limbs, distorted poses, or physically implausible actions. In this work, we propose a remarkably simple reward model, HuDA, to quantify and improve the human motion in generated videos. HuDA integrates human detection confidence for appearance quality, and a temporal prompt alignment score to capture motion realism. We show this simple reward function that leverages off-the-shelf models without any additional training, outperforms specialized models finetuned with manually annotated data. Using HuDA for Group Reward Policy Optimization (GRPO) post-training of video models, we significantly enhance video generation, especially when generating complex human motions, outperforming state-of-the-art models like Wan 2.1, with win-rate of 73%. Finally, we demonstrate that HuDA improves generation quality beyond just humans, for instance, significantly improving generation of animal videos and human-object interactions.

Problem

Research questions and friction points this paper is trying to address.

human motion

video generation

non-rigid motion

temporal coherence

visual fidelity

Innovation

Methods, ideas, or system contributions that make the work stand out.

HuDA

reward modeling

video generation