GT-SVJ: Generative-Transformer-Based Self-Supervised Video Judge For Efficient Video Reward Modeling

📅 2026-02-05

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Existing vision-language models struggle to capture fine-grained temporal dynamics in video reward modeling, leading to insufficient alignment with human preferences. This work proposes repurposing generative Transformer-based video models as reward models by casting them into energy-based models (EBMs) via contrastive learning. To enhance robustness and semantic awareness, the approach synthesizes realistic negative samples through controlled perturbations in the latent space—such as temporal slicing, feature swapping, and frame shuffling—compelling the model to focus on semantic spatiotemporal features rather than superficial artifacts. This method achieves state-of-the-art performance on GenAI-Bench and MonteBench, requiring only 30,000 human annotations, which represents a 6–65× reduction in labeling effort compared to existing vision-language model approaches.

Technology Category

Application Category

📝 Abstract

Aligning video generative models with human preferences remains challenging: current approaches rely on Vision-Language Models (VLMs) for reward modeling, but these models struggle to capture subtle temporal dynamics. We propose a fundamentally different approach: repurposing video generative models, which are inherently designed to model temporal structure, as reward models. We present the Generative-Transformer-based Self-Supervised Video Judge (\modelname), a novel evaluation model that transforms state-of-the-art video generation models into powerful temporally-aware reward models. Our key insight is that generative models can be reformulated as energy-based models (EBMs) that assign low energy to high-quality videos and high energy to degraded ones, enabling them to discriminate video quality with remarkable precision when trained via contrastive objectives. To prevent the model from exploiting superficial differences between real and generated videos, we design challenging synthetic negative videos through controlled latent-space perturbations: temporal slicing, feature swapping, and frame shuffling, which simulate realistic but subtle visual degradations. This forces the model to learn meaningful spatiotemporal features rather than trivial artifacts. \modelname achieves state-of-the-art performance on GenAI-Bench and MonteBench using only 30K human-annotations: $6\times$ to $65\times$ fewer than existing VLM-based approaches.

Problem

Research questions and friction points this paper is trying to address.

video reward modeling

temporal dynamics

human preferences

generative models

self-supervised learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative Transformer

Energy-Based Model

Self-Supervised Reward Modeling