🤖 AI Summary
Existing Vision Transformers (ViTs), such as CLIP, employ fixed low-resolution inputs (typically 224×224), leading to significant loss of fine-grained spatial detail in feature maps and limiting performance on tasks requiring high structural fidelity. To address this, we propose a lightweight, training-free, semantically coherent feature-level upsampling method that integrates differentiable interpolation with local attention-guided feature super-resolution. The approach is architecture-agnostic and seamlessly embeddable into mainstream ViT-based pipelines—including semantic segmentation, object detection, and knowledge distillation frameworks like RADIO. Without retraining or increasing inference latency, it effectively recovers structural information suppressed by low-resolution encoding. Empirically, it consistently improves mAP and IoU across multiple downstream tasks. When adopted as the distillation target in RADIO, it enables student models to closely match the performance of high-resolution teacher models while incurring negligible additional computational overhead.
📝 Abstract
The feature maps of vision encoders are fundamental to myriad modern AI tasks, ranging from core perception algorithms (e.g. semantic segmentation, object detection, depth perception, etc.) to modern multimodal understanding in vision-language models (VLMs). Currently, in computer vision, the frontier of general purpose vision backbones are Vision Transformers (ViT), typically trained using contrastive loss (e.g. CLIP). A key problem with most off-the-shelf ViTs, particularly CLIP, is that these models are inflexibly low resolution. Most run at 224x224px, while the"high resolution"versions are around 378-448px, but still inflexible. We introduce a novel method to coherently and cheaply upsample the feature maps of low-res vision encoders while picking up on fine-grained details that would otherwise be lost due to resolution. We demonstrate the effectiveness of this approach on core perception tasks as well as within agglomerative model (RADIO) training as a way of providing richer targets for distillation.