Self-Evolving Vision-Language Models for Image Quality Assessment via Voting and Ranking

📅 2025-09-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Image quality assessment (IQA) suffers from a scarcity of high-quality human annotations, hindering the unsupervised performance improvement of vision-language models (VLMs). Method: This paper proposes EvoQuality—the first framework to introduce self-consistency into IQA. It leverages a VLM’s multiple pairwise ranking outputs over image pairs to generate high-confidence pseudo-labels via pairwise majority voting. Further, it introduces Groupwise Relative Policy Optimization (GRPO), a novel mechanism that constructs fidelity-based rewards to drive iterative model evolution. Contribution/Results: EvoQuality operates entirely without ground-truth labels. In zero-shot evaluation across seven mainstream IQA benchmarks, it achieves an average 31.8% improvement in Pearson Linear Correlation Coefficient (PLCC) over prior unsupervised methods; notably, it surpasses existing supervised approaches on five benchmarks, establishing new state-of-the-art performance.

Technology Category

Application Category

📝 Abstract
Improving vision-language models (VLMs) in the post-training stage typically relies on supervised fine-tuning or reinforcement learning, methods that necessitate costly, human-annotated data. While self-supervised techniques such as self-consistency have proven effective for enhancing reasoning capabilities, their application to perceptual domains such as image quality assessment (IQA) remains largely unexplored. In this work, we introduce EvoQuality, a novel framework that enables a VLM to autonomously refine its quality perception capabilities without any ground-truth labels. EvoQuality adapts the principle of self-consistency to the ranking-based nature of IQA. It generates pseudo-labels by performing pairwise majority voting on the VLM's own outputs to establish a consensus on relative quality. These pseudo-rankings are then formulated into a fidelity reward that guides the model's iterative evolution through group relative policy optimization (GRPO). By iteratively leveraging its own predictions, EvoQuality progressively refines the VLM's perceptual capability. Extensive experiments show that EvoQuality boosts the base VLM's zero-shot performance by 31.8% on PLCC across diverse IQA benchmarks. Remarkably, despite being entirely self-supervised, EvoQuality achieves performance that is competitive with, or even surpasses, state-of-the-art supervised VLM-based IQA models, outperforming these models on 5 out of 7 IQA benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Enhancing vision-language models for image quality assessment
Developing self-supervised methods without human-annotated data
Refining perceptual capabilities through iterative self-evolution
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-evolving VLM via voting and ranking
Generates pseudo-labels through majority voting
Uses GRPO for iterative model evolution
🔎 Similar Papers
No similar papers found.
W
Wen Wen
City University of Hong Kong
T
Tianwu Zhi
ByteDance Inc.
K
Kanglong Fan
City University of Hong Kong
Y
Yang Li
ByteDance Inc.
X
Xinge Peng
ByteDance Inc.
Y
Yabin Zhang
ByteDance Inc.
Yiting Liao
Yiting Liao
Staff Research Scientist at Wireless Communications Research, Intel Labs
Video ProcessingVideo CommunicationsVideo Understanding
Junlin Li
Junlin Li
ByteDance Inc. - Georgia Institute of Technology - Tsinghua University
Video Compression and ProcessingVideo StreamingMachine LearningAIASIC Design
L
Li Zhang
ByteDance Inc.