Don't Judge Before You CLIP: A Unified Approach for Perceptual Tasks

📅 2025-03-17

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Visual perception tasks—such as affective analysis, quality assessment, and memorability prediction—rely heavily on subjective human annotations, leading to data scarcity, poor cross-dataset generalization, and task-specific modeling. To address these limitations, we propose the first unified, lightweight adaptation framework grounded in CLIP’s pretrained multimodal priors. We systematically demonstrate that CLIP’s image–text alignment objective implicitly encodes human judgment tendencies; thus, the sentiment and perceptual semantics embedded in its generated captions serve as effective universal priors. Our framework introduces a minimal adapter—comprising only linear projections and a small number of trainable parameters—that enables joint zero-shot transfer and supervised fine-tuning. Evaluated on image memorability prediction, no-reference image quality assessment, and visual affective analysis, our method achieves state-of-the-art performance across all three tasks while significantly improving cross-dataset generalization.

Technology Category

Application Category

📝 Abstract

Visual perceptual tasks aim to predict human judgment of images (e.g., emotions invoked by images, image quality assessment). Unlike objective tasks such as object/scene recognition, perceptual tasks rely on subjective human assessments, making its data-labeling difficult. The scarcity of such human-annotated data results in small datasets leading to poor generalization. Typically, specialized models were designed for each perceptual task, tailored to its unique characteristics and its own training dataset. We propose a unified architectural framework for solving multiple different perceptual tasks leveraging CLIP as a prior. Our approach is based on recent cognitive findings which indicate that CLIP correlates well with human judgment. While CLIP was explicitly trained to align images and text, it implicitly also learned human inclinations. We attribute this to the inclusion of human-written image captions in CLIP's training data, which contain not only factual image descriptions, but inevitably also human sentiments and emotions. This makes CLIP a particularly strong prior for perceptual tasks. Accordingly, we suggest that minimal adaptation of CLIP suffices for solving a variety of perceptual tasks. Our simple unified framework employs a lightweight adaptation to fine-tune CLIP to each task, without requiring any task-specific architectural changes. We evaluate our approach on three tasks: (i) Image Memorability Prediction, (ii) No-reference Image Quality Assessment, and (iii) Visual Emotion Analysis. Our model achieves state-of-the-art results on all three tasks, while demonstrating improved generalization across different datasets.

Problem

Research questions and friction points this paper is trying to address.

Addresses challenges in perceptual tasks due to subjective human assessments.

Proposes a unified framework using CLIP for multiple perceptual tasks.

Achieves state-of-the-art results with minimal task-specific adaptation.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified framework using CLIP for perceptual tasks

Lightweight adaptation fine-tunes CLIP for tasks

CLIP leverages human-written captions for human judgment

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs