CLIP Brings Better Features to Visual Aesthetics Learners

📅 2023-07-28
🏛️ arXiv.org
📈 Citations: 4
Influential: 1
📄 PDF
🤖 AI Summary
Image Aesthetic Assessment (IAA) faces challenges including severe label scarcity, high subjectivity, and the incompatibility of large CLIP-based models with lightweight deployment scenarios. To address these, we propose CSKD, a two-stage CLIP-driven semi-supervised knowledge distillation framework. In Stage I, cross-encoder feature alignment mitigates feature collapse during CLIP fine-tuning. In Stage II, multi-source unlabeled data are leveraged via consistency regularization and attention-distance constraints for collaborative optimization. Our method introduces the first unified, scalable CLIP feature transfer paradigm, enabling plug-and-play integration of arbitrary vision encoders. Evaluated on multiple mainstream IAA benchmarks, CSKD achieves state-of-the-art performance: student models attain significant accuracy gains while reducing inference latency, demonstrating strong efficacy in resource-constrained and low-label regimes.
📝 Abstract
The success of pre-training approaches on a variety of downstream tasks has revitalized the field of computer vision. Image aesthetics assessment (IAA) is one of the ideal application scenarios for such methods due to subjective and expensive labeling procedure. In this work, an unified and flexible two-phase extbf{C}LIP-based extbf{S}emi-supervised extbf{K}nowledge extbf{D}istillation paradigm is proposed, namely extbf{ extit{CSKD}}. Specifically, we first integrate and leverage a multi-source unlabeled dataset to align rich features between a given visual encoder and an off-the-shelf CLIP image encoder via feature alignment loss. Notably, the given visual encoder is not limited by size or structure and, once well-trained, it can seamlessly serve as a better visual aesthetic learner for both student and teacher. In the second phase, the unlabeled data is also utilized in semi-supervised IAA learning to further boost student model performance when applied in latency-sensitive production scenarios. By analyzing the attention distance and entropy before and after feature alignment, we notice an alleviation of feature collapse issue, which in turn showcase the necessity of feature alignment instead of training directly based on CLIP image encoder. Extensive experiments indicate the superiority of CSKD, which achieves state-of-the-art performance on multiple widely used IAA benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Leveraging CLIP for Image Aesthetics Assessment tasks
Overcoming limited data and resource constraints in IAA
Distilling CLIP knowledge to lightweight IAA models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-phase CLIP-based Semi-supervised Knowledge Distillation
Feature alignment for heterogeneous model distillation
Collaborative distillation with unlabeled examples
🔎 Similar Papers
2024-08-27International Conference on Pattern RecognitionCitations: 5
2023-11-28European Conference on Computer VisionCitations: 4
L
Liwu Xu
OPPO Research Institute
J
Jinjin Xu
OPPO Research Institute
Y
Yuzhe Yang
OPPO Research Institute
Yi-Jie Huang
Yi-Jie Huang
OPPO Research Institute
Y
Yanchun Xie
OPPO Research Institute
Yaqian Li
Yaqian Li
Li Auto
computer vision