🤖 AI Summary
This study investigates whether vision-language models (VLMs) can effectively model human visual preferences. To address this, we propose a novel test-time preference inference paradigm that integrates reinforcement learning—inspired by DeepSeek R1/O1—with intrinsic VLM reasoning mechanisms and a soft reward strategy, enabling fully unsupervised, interpretable image ranking without annotated labels while generalizing across diverse image resolutions and complexities. Our method is fine-tuned on the ImageReward and HPSv2 preference datasets and outperforms conventional static scoring or pairwise classification approaches. Experiments demonstrate排序 accuracies of 64.9% and 65.4% on ImageReward and HPSv2, respectively—matching the performance of dedicated preference encoders. To our knowledge, this is the first work to achieve zero-shot, end-to-end visual preference modeling with VLMs at test time.
📝 Abstract
Can Visual Language Models (VLMs) effectively capture human visual preferences? This work addresses this question by training VLMs to think about preferences at test time, employing reinforcement learning methods inspired by DeepSeek R1 and OpenAI O1. Using datasets such as ImageReward and Human Preference Score v2 (HPSv2), our models achieve accuracies of 64.9% on the ImageReward test set (trained on ImageReward official split) and 65.4% on HPSv2 (trained on approximately 25% of its data). These results match traditional encoder-based models while providing transparent reasoning and enhanced generalization. This approach allows to use not only rich VLM world knowledge, but also its potential to think, yielding interpretable outcomes that help decision-making processes. By demonstrating that human visual preferences reasonable by current VLMs, we introduce efficient soft-reward strategies for image ranking, outperforming simplistic selection or scoring methods. This reasoning capability enables VLMs to rank arbitrary images-regardless of aspect ratio or complexity-thereby potentially amplifying the effectiveness of visual Preference Optimization. By reducing the need for extensive markup while improving reward generalization and explainability, our findings can be a strong mile-stone that will enhance text-to-vision models even further.