Embedding Shift Dissection on CLIP: Effects of Augmentations on VLM's Representation Learning

📅 2025-03-30

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work investigates embedding shifts in vision-language models (e.g., CLIP) under image data augmentation, uncovering their profound impact on representation interpretability and robustness. We systematically evaluate nine common augmentations—quantifying their perturbations on attention maps, image patches, and edge/detail fidelity—and introduce the first mechanistic interpretability–informed model of embedding shift dynamics. Noise, perspective transformation, and scale translation are identified as high-perturbation augmentations. A multi-granularity evaluation framework is proposed, incorporating cosine similarity, L2 distance, pairwise distance, hierarchical clustering (dendrograms), and qualitative visualizations. We release the first reproducible benchmark for VLM representation stability, empirically demonstrating an inherent trade-off between augmentation robustness and representation interpretability. This work provides both theoretical foundations and practical tools for adversarial robustness design and interpretable AI.

Technology Category

Application Category

📝 Abstract

Understanding the representation shift on Vision Language Models like CLIP under different augmentations provides valuable insights on Mechanistic Interpretability. In this study, we show the shift on CLIP's embeddings on 9 common augmentation techniques: noise, blur, color jitter, scale and rotate, flip, elastic and perspective transforms, random brightness and contrast, and coarse dropout of pixel blocks. We scrutinize the embedding shifts under similarity on attention map, patch, edge, detail preservation, cosine similarity, L2 distance, pairwise distance and dendrogram clusters and provide qualitative analysis on sample images. Our findings suggest certain augmentations like noise, perspective transform and shift scaling have higher degree of drastic impact on embedding shift. This study provides a concrete foundation for future work on VLM's robustness for mechanical interpretation and adversarial data defense.

Problem

Research questions and friction points this paper is trying to address.

Analyze CLIP's embedding shifts under 9 augmentations

Evaluate impact on similarity metrics and image features

Identify augmentations causing drastic embedding changes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzes CLIP embedding shifts under augmentations

Examines 9 augmentation techniques on VLM

Identifies high-impact augmentations via similarity metrics

🔎 Similar Papers

CLAP: Isolating Content from Style Through Contrastive Learning with Augmented Prompts