Visual Interestingness Decoded: How GPT-4o Mirrors Human Interests

📅 2025-10-15

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

This study investigates the capability of large multimodal models (e.g., GPT-4o) to perceive visual interestingness and their alignment with human judgments. Method: We propose a GPT-4o–based pairwise interest ranking generation framework that leverages its multimodal understanding to automatically annotate large-scale preference data, followed by knowledge distillation via learning-to-rank models. Contribution/Results: We provide the first empirical evidence that GPT-4o exhibits significant—though incomplete—alignment with human assessments of visual interestingness, confirming its implicit interest perception capability. Our method outperforms existing visual interestingness prediction models across multiple benchmarks. Moreover, it establishes the first scalable, low-cost paradigm for generating high-quality interest annotations, enabling new avenues for human-AI collaborative interest modeling and cross-modal perceptual alignment.

Technology Category

Application Category

📝 Abstract

Our daily life is highly influenced by what we consume and see. Attracting and holding one's attention -- the definition of (visual) interestingness -- is essential. The rise of Large Multimodal Models (LMMs) trained on large-scale visual and textual data has demonstrated impressive capabilities. We explore these models' potential to understand to what extent the concepts of visual interestingness are captured and examine the alignment between human assessments and GPT-4o's, a leading LMM, predictions through comparative analysis. Our studies reveal partial alignment between humans and GPT-4o. It already captures the concept as best compared to state-of-the-art methods. Hence, this allows for the effective labeling of image pairs according to their (commonly) interestingness, which are used as training data to distill the knowledge into a learning-to-rank model. The insights pave the way for a deeper understanding of human interest.

Problem

Research questions and friction points this paper is trying to address.

Investigating GPT-4o's alignment with human visual interestingness assessments

Comparing LMM predictions against human judgments through analysis

Developing training data for learning-to-rank models from interestingness labels

Innovation

Methods, ideas, or system contributions that make the work stand out.

GPT-4o predicts visual interestingness like humans

Knowledge distillation trains learning-to-rank model

Image pairs labeled by GPT-4o for training

🔎 Similar Papers

Have Large Vision-Language Models Mastered Art History?