FeatSharp: Your Vision Model Features, Sharper

📅 2025-02-22

📈 Citations: 0

✨ Influential: 0

career value

156K/year

🤖 AI Summary

Existing Vision Transformers (ViTs), such as CLIP, employ fixed low-resolution inputs (typically 224×224), leading to significant loss of fine-grained spatial detail in feature maps and limiting performance on tasks requiring high structural fidelity. To address this, we propose a lightweight, training-free, semantically coherent feature-level upsampling method that integrates differentiable interpolation with local attention-guided feature super-resolution. The approach is architecture-agnostic and seamlessly embeddable into mainstream ViT-based pipelines—including semantic segmentation, object detection, and knowledge distillation frameworks like RADIO. Without retraining or increasing inference latency, it effectively recovers structural information suppressed by low-resolution encoding. Empirically, it consistently improves mAP and IoU across multiple downstream tasks. When adopted as the distillation target in RADIO, it enables student models to closely match the performance of high-resolution teacher models while incurring negligible additional computational overhead.

Technology Category

Application Category

📝 Abstract

The feature maps of vision encoders are fundamental to myriad modern AI tasks, ranging from core perception algorithms (e.g. semantic segmentation, object detection, depth perception, etc.) to modern multimodal understanding in vision-language models (VLMs). Currently, in computer vision, the frontier of general purpose vision backbones are Vision Transformers (ViT), typically trained using contrastive loss (e.g. CLIP). A key problem with most off-the-shelf ViTs, particularly CLIP, is that these models are inflexibly low resolution. Most run at 224x224px, while the"high resolution"versions are around 378-448px, but still inflexible. We introduce a novel method to coherently and cheaply upsample the feature maps of low-res vision encoders while picking up on fine-grained details that would otherwise be lost due to resolution. We demonstrate the effectiveness of this approach on core perception tasks as well as within agglomerative model (RADIO) training as a way of providing richer targets for distillation.

Problem

Research questions and friction points this paper is trying to address.

Enhancing low-res vision encoders' feature maps

Upsampling to retain fine-grained details

Improving core perception and multimodal tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Upsampling low-res vision encoder features

Enhancing fine-grained detail retention

Improving distillation with richer targets

🔎 Similar Papers

Better Language Models Exhibit Higher Visual Alignment