🤖 AI Summary
Visual foundation models (VFMs) suffer from spatially downsampled features that hinder pixel-level task performance. Existing upsampling methods face a trade-off between accuracy and generality: classical filters are efficient but expressively limited, while learnable approaches achieve higher fidelity yet require model-specific training. To address this, we propose NAF—the first zero-shot, cross-model generalizable feature upsampling method—requiring no training whatsoever. NAF leverages neighborhood attention and rotation-based positional encoding to adaptively generate spatial-content joint weights directly from high-resolution inputs. Our method enables real-time upsampling of 2K images at 18 FPS. Extensive evaluations across semantic segmentation, depth estimation, and image inpainting demonstrate consistent superiority over both general-purpose and VFM-specific upsamplers. NAF is the first approach to simultaneously achieve high efficiency, high accuracy, and strong generalization across diverse VFMs. Code and pretrained models are publicly available.
📝 Abstract
Vision Foundation Models (VFMs) extract spatially downsampled representations, posing challenges for pixel-level tasks. Existing upsampling approaches face a fundamental trade-off: classical filters are fast and broadly applicable but rely on fixed forms, while modern upsamplers achieve superior accuracy through learnable, VFM-specific forms at the cost of retraining for each VFM. We introduce Neighborhood Attention Filtering (NAF), which bridges this gap by learning adaptive spatial-and-content weights through Cross-Scale Neighborhood Attention and Rotary Position Embeddings (RoPE), guided solely by the high-resolution input image. NAF operates zero-shot: it upsamples features from any VFM without retraining, making it the first VFM-agnostic architecture to outperform VFM-specific upsamplers and achieve state-of-the-art performance across multiple downstream tasks. It maintains high efficiency, scaling to 2K feature maps and reconstructing intermediate-resolution maps at 18 FPS. Beyond feature upsampling, NAF demonstrates strong performance on image restoration, highlighting its versatility. Code and checkpoints are available at https://github.com/valeoai/NAF.