đ¤ AI Summary
This work investigates the non-monotonic âincrease-then-decreaseâ alignment between CLIPâs internal representations and human low-level image quality perception during training. We jointly analyze the evolution of shapeâtexture bias and classification stability under adversarial perturbations. Results show that early training prioritizes texture-based featuresâenhancing correlation with human perceptual judgmentsâwhile later stages shift toward abstract, shape-invariant representations to improve robustness, thereby degrading perceptual alignment. We thus identify an intrinsic trade-off between low-level perceptual alignment and high-level task robustnessâa novel âalignmentârobustness trade-offâ perspective. Extensive experiments across multiple benchmarks confirm the generality of this mechanism. Our findings provide both theoretical foundations and practical guidance for enhancing interpretability and enabling alignment-controllable training in multimodal foundation models.
đ Abstract
During the training of multi-modal models like CLIP, we observed an intriguing phenomenon: the correlation with low-level human image quality assessments peaks in the early epochs before gradually declining. This study investigates this observation and seeks to understand its causes through two key factors: shape-texture bias alignment and classification accuracy drop under noise. Our findings suggest that CLIP initially learn low-level visual features, enhancing its alignment with low-level human perception but also increasing its sensitivity to noise and its texture bias. As training progresses, the model shifts toward more abstract shape-based representations, improving noise robustness but reducing alignment with low-level human perception. These results suggest that these factors shared an underlying learning mechanism and provide new insights into optimizing the trade-off between perceptual alignment and robustness in vision-language models.