🤖 AI Summary
This work addresses the insufficient zero-shot classification robustness of vision-language foundation models on low-resolution or pixelated images. To this end, we introduce LR0.FM—the first dedicated benchmark for resolution robustness—comprising 66 backbone architectures and 15 datasets, enabling systematic evaluation of 10 state-of-the-art models. We propose a weighted aggregation robustness metric and identify three key findings: (i) pretraining data quality outweighs scale in determining robustness; (ii) larger models exhibit inherent resolution robustness; and (iii) semantic predictions remain plausible under severe downsampling, while shallow features are disproportionately sensitive. Building on these insights, we design LR-TK0—a lightweight, fine-tuning-free enhancement strategy. Experiments confirm a strong positive correlation between model size and robustness. LR-TK0 consistently improves performance across diverse models and datasets without altering original weights, demonstrating both efficacy and parameter efficiency.
📝 Abstract
Visual-language foundation Models (FMs) exhibit remarkable zero-shot generalization across diverse tasks, largely attributed to extensive pre-training on large-scale datasets. However, their robustness on low-resolution/pixelated (LR) images, a common challenge in real-world scenarios, remains underexplored. We introduce LR0.FM, a comprehensive benchmark evaluating the impact of low resolution on the zero-shot classification performance of 10 FM(s) across 66 backbones and 15 datasets. We propose a novel metric, Weighted Aggregated Robustness, to address the limitations of existing metrics and better evaluate model performance across resolutions and datasets. Our key findings show that: (i) model size positively correlates with robustness to resolution degradation, (ii) pre-training dataset quality is more important than its size, and (iii) fine-tuned and higher-resolution models are less robust against LR. Our analysis further reveals that the model makes semantically reasonable predictions at LR, and the lack of fine-grained details in input adversely impacts the model's initial layers more than the deeper layers. We use these insights and introduce a simple strategy, LR-TK0, to enhance the robustness of models without compromising their pre-trained weights. We demonstrate the effectiveness of LR-TK0 for robustness against low-resolution across several datasets and its generalization capability across backbones and other approaches. Code is available at this https://github.com/shyammarjit/LR0.FM