🤖 AI Summary
This study addresses the instability of visual foundation models under common image corruptions—such as JPEG compression and variations in brightness and contrast—which significantly degrades their performance on downstream tasks. The authors present the first systematic evaluation of robustness across six industrial-scale models under nine distinct corruptions. They introduce three novel robustness metrics, each grounded in five theoretically justified properties, and establish a predictable relationship between these metrics and downstream task performance. Furthermore, they propose a fine-tuning strategy that enhances model robustness without compromising utility. Experimental results demonstrate that prevailing foundation models generally lack corruption robustness, whereas the proposed approach effectively improves robustness while maintaining model effectiveness.
📝 Abstract
A vision foundation model outputs an embedding vector for an image, which can be affected by common editing operations (e.g., JPEG compression, brightness, contrast adjustments). These common perturbations alter embedding vectors and may impact the performance of downstream tasks using these embeddings. In this work, we present the first systematic study on foundation models' robustness to such perturbations. We propose three robustness metrics and formulate five desired mathematical properties for these metrics, analyzing which properties they satisfy or violate. Using these metrics, we evaluate six industry-scale foundation models (OpenAI, Meta) across nine common perturbation categories, finding them generally non-robust. We also show that common perturbations degrade downstream application performance (e.g., classification accuracy) and that robustness values can predict performance impacts. Finally, we propose a fine-tuning approach to improve robustness without sacrificing utility.