Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning

📅 2025-05-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of a general, reliable reward model for multimodal understanding and reasoning tasks. We propose the first unified reward model that jointly models both multimodal understanding and reasoning capabilities. Methodologically, we build a reward head upon Qwen2.5-VL-7B-Instruct and introduce a Mixed Preference Optimization (MPO) training paradigm, integrating large-scale cross-modal preference data with multi-stage supervised fine-tuning and employing pairwise ranking loss. Our key contributions are: (1) the first unified modeling framework for multimodal understanding and reasoning; (2) MPO substantially enhances evaluation performance on multimodal reasoning tasks; and (3) our model achieves state-of-the-art results on VL-RewardBench and strong performance on the text-only RewardBench benchmark. The model is publicly released.

Technology Category

Application Category

📝 Abstract
We propose Skywork-VL Reward, a multimodal reward model that provides reward signals for both multimodal understanding and reasoning tasks. Our technical approach comprises two key components: First, we construct a large-scale multimodal preference dataset that covers a wide range of tasks and scenarios, with responses collected from both standard vision-language models (VLMs) and advanced VLM reasoners. Second, we design a reward model architecture based on Qwen2.5-VL-7B-Instruct, integrating a reward head and applying multi-stage fine-tuning using pairwise ranking loss on pairwise preference data. Experimental evaluations show that Skywork-VL Reward achieves state-of-the-art results on multimodal VL-RewardBench and exhibits competitive performance on the text-only RewardBench benchmark. Furthermore, preference data constructed based on our Skywork-VL Reward proves highly effective for training Mixed Preference Optimization (MPO), leading to significant improvements in multimodal reasoning capabilities. Our results underscore Skywork-VL Reward as a significant advancement toward general-purpose, reliable reward models for multimodal alignment. Our model has been publicly released to promote transparency and reproducibility.
Problem

Research questions and friction points this paper is trying to address.

Develops a reward model for multimodal understanding and reasoning
Creates large-scale multimodal preference dataset for diverse tasks
Enhances multimodal alignment with advanced architecture and fine-tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale multimodal preference dataset construction
Qwen2.5-VL-7B-Instruct based reward model architecture
Multi-stage fine-tuning with pairwise ranking loss
🔎 Similar Papers
No similar papers found.
Xiaokun Wang
Xiaokun Wang
Nanjing University
Video analytics
Chris
Chris
Skywork AI, Kunlun Inc.
J
Jiangbo Pei
Skywork AI, Kunlun Inc.
W
Wei Shen
Skywork AI, Kunlun Inc.
Yi Peng
Yi Peng
Bytedance
Machine LearningImage ProcessingVisualization
Yunzhuo Hao
Yunzhuo Hao
CS PhD Student @ Zhejiang University
MLLMLLMNLP
W
Weijie Qiu
Skywork AI, Kunlun Inc.
A
Ai Jian
Skywork AI, Kunlun Inc.
T
Tianyidan Xie
Skywork AI, Kunlun Inc.
Xuchen Song
Xuchen Song
CTO @ Mureka.ai | Head of Multimodality & Spatial AI @ Skywork.ai
Music GenerationMultimodalityMultimodal UnderstandingMultimodal Generation
Y
Yang Liu
Skywork AI, Kunlun Inc.
Y
Yahui Zhou
Skywork AI, Kunlun Inc.