Modeling Image-Caption Rating from Comparative Judgments

📅 2026-01-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes a comparative learning approach based on human relative preference judgments to address the high cost and subjectivity of manual scoring in image–text matching quality assessment. By replacing conventional regression modeling with pairwise comparison learning, the method significantly reduces annotation costs while improving inter-annotator consistency. The model leverages ResNet-50 for visual feature extraction and MiniLM for textual features within a paired comparison framework. Experimental results demonstrate that performance steadily improves with increasing data volume, achieving a Pearson correlation coefficient of 0.7609—approaching the performance of regression-based models. Human evaluations further confirm that the proposed approach yields higher annotation efficiency and stronger consistency compared to traditional scoring methods.

Technology Category

Application Category

📝 Abstract
Rating the accuracy of captions in describing images is time-consuming and subjective for humans. In contrast, it is often easier for people to compare two captions and decide which one better matches a given image. In this work, we propose a machine learning framework that models such comparative judgments instead of direct ratings. The model can then be applied to rank unseen image-caption pairs in the same way as a regression model trained on direct ratings. Using the VICR dataset, we extract visual features with ResNet-50 and text features with MiniLM, then train both a regression model and a comparative learning model. While the regression model achieves better performance (Pearson's $\rho$: 0.7609 and Spearman's $r_s$: 0.7089), the comparative learning model steadily improves with more data and approaches the regression baseline. In addition, a small-scale human evaluation study comparing absolute rating, pairwise comparison, and same-image comparison shows that comparative annotation yields faster results and has greater agreement among human annotators. These results suggest that comparative learning can effectively model human preferences while significantly reducing the cost of human annotations.
Problem

Research questions and friction points this paper is trying to address.

image-caption rating
comparative judgments
human annotation
caption evaluation
preference modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

comparative learning
image-caption rating
pairwise comparison
human annotation efficiency
preference modeling
🔎 Similar Papers
No similar papers found.
K
Kezia Minni
Department of Software Engineering, Rochester Institute of Technology, Rochester, NY, USA
Q
Qiang Zhang
Department of Software Engineering, Rochester Institute of Technology, Rochester, NY, USA
M
Monoshiz Mahbub Khan
Department of Software Engineering, Rochester Institute of Technology, Rochester, NY, USA
Zhe Yu
Zhe Yu
Software Engineering, Rochester Institute of Technology
Software EngineeringMachine LearningData MiningInformation RetrievalHuman-Centered Computing