Evolution of Low-Level and Texture Human-CLIP Alignment

📅 2025-08-13
📈 Citations: 0
✨ Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the non-monotonic “increase-then-decrease” alignment between CLIP’s internal representations and human low-level image quality perception during training. We jointly analyze the evolution of shape–texture bias and classification stability under adversarial perturbations. Results show that early training prioritizes texture-based features—enhancing correlation with human perceptual judgments—while later stages shift toward abstract, shape-invariant representations to improve robustness, thereby degrading perceptual alignment. We thus identify an intrinsic trade-off between low-level perceptual alignment and high-level task robustness—a novel “alignment–robustness trade-off” perspective. Extensive experiments across multiple benchmarks confirm the generality of this mechanism. Our findings provide both theoretical foundations and practical guidance for enhancing interpretability and enabling alignment-controllable training in multimodal foundation models.

Technology Category

Application Category

📝 Abstract
During the training of multi-modal models like CLIP, we observed an intriguing phenomenon: the correlation with low-level human image quality assessments peaks in the early epochs before gradually declining. This study investigates this observation and seeks to understand its causes through two key factors: shape-texture bias alignment and classification accuracy drop under noise. Our findings suggest that CLIP initially learn low-level visual features, enhancing its alignment with low-level human perception but also increasing its sensitivity to noise and its texture bias. As training progresses, the model shifts toward more abstract shape-based representations, improving noise robustness but reducing alignment with low-level human perception. These results suggest that these factors shared an underlying learning mechanism and provide new insights into optimizing the trade-off between perceptual alignment and robustness in vision-language models.
Problem

Research questions and friction points this paper is trying to address.

Investigating early peak in low-level human-CLIP quality correlation
Understanding causes through shape-texture bias and noise sensitivity
Optimizing trade-off between perceptual alignment and robustness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Early training enhances low-level human perception alignment
Progressive shift to abstract shape-based representations improves robustness
Optimizing trade-off between perceptual alignment and noise robustness
🔎 Similar Papers
No similar papers found.