VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation

๐Ÿ“… 2024-12-30
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the coarse-grained, opaque, and confounder-prone nature of human preference modeling in image and video generation, this paper proposes VisionRewardโ€”a fine-grained, multi-dimensional visual reward model. Methodologically, it introduces a novel disentangled preference modeling framework that structures human preferences into multiple interpretable dimensions, enabling unified assessment of images and videos via dynamic feature modeling and structured judgment queries. The approach integrates multi-dimensional linearly weighted scoring, temporal dynamic feature analysis for videos, and a multi-objective preference learning algorithm, trained on high-quality human-annotated data. Experiments demonstrate that VisionReward achieves a 17.2% improvement over VideoScore in video preference prediction and attains state-of-the-art performance in both automated metrics and human evaluations. The code and dataset are fully open-sourced.

Technology Category

Application Category

๐Ÿ“ Abstract
We present a general strategy to aligning visual generation models -- both image and video generation -- with human preference. To start with, we build VisionReward -- a fine-grained and multi-dimensional reward model. We decompose human preferences in images and videos into multiple dimensions, each represented by a series of judgment questions, linearly weighted and summed to an interpretable and accurate score. To address the challenges of video quality assessment, we systematically analyze various dynamic features of videos, which helps VisionReward surpass VideoScore by 17.2% and achieve top performance for video preference prediction. Based on VisionReward, we develop a multi-objective preference learning algorithm that effectively addresses the issue of confounding factors within preference data. Our approach significantly outperforms existing image and video scoring methods on both machine metrics and human evaluation. All code and datasets are provided at https://github.com/THUDM/VisionReward.
Problem

Research questions and friction points this paper is trying to address.

Image Preferences
Visual Content Generation
Human Perception
Innovation

Methods, ideas, or system contributions that make the work stand out.

VisionReward
Visual Preference Evaluation
Enhanced Accuracy Learning Algorithm
๐Ÿ”Ž Similar Papers
No similar papers found.