🤖 AI Summary
This work investigates embedding shifts in vision-language models (e.g., CLIP) under image data augmentation, uncovering their profound impact on representation interpretability and robustness. We systematically evaluate nine common augmentations—quantifying their perturbations on attention maps, image patches, and edge/detail fidelity—and introduce the first mechanistic interpretability–informed model of embedding shift dynamics. Noise, perspective transformation, and scale translation are identified as high-perturbation augmentations. A multi-granularity evaluation framework is proposed, incorporating cosine similarity, L2 distance, pairwise distance, hierarchical clustering (dendrograms), and qualitative visualizations. We release the first reproducible benchmark for VLM representation stability, empirically demonstrating an inherent trade-off between augmentation robustness and representation interpretability. This work provides both theoretical foundations and practical tools for adversarial robustness design and interpretable AI.
📝 Abstract
Understanding the representation shift on Vision Language Models like CLIP under different augmentations provides valuable insights on Mechanistic Interpretability. In this study, we show the shift on CLIP's embeddings on 9 common augmentation techniques: noise, blur, color jitter, scale and rotate, flip, elastic and perspective transforms, random brightness and contrast, and coarse dropout of pixel blocks. We scrutinize the embedding shifts under similarity on attention map, patch, edge, detail preservation, cosine similarity, L2 distance, pairwise distance and dendrogram clusters and provide qualitative analysis on sample images. Our findings suggest certain augmentations like noise, perspective transform and shift scaling have higher degree of drastic impact on embedding shift. This study provides a concrete foundation for future work on VLM's robustness for mechanical interpretation and adversarial data defense.