NAF: Zero-Shot Feature Upsampling via Neighborhood Attention Filtering

📅 2025-11-23

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

Visual foundation models (VFMs) suffer from spatially downsampled features that hinder pixel-level task performance. Existing upsampling methods face a trade-off between accuracy and generality: classical filters are efficient but expressively limited, while learnable approaches achieve higher fidelity yet require model-specific training. To address this, we propose NAF—the first zero-shot, cross-model generalizable feature upsampling method—requiring no training whatsoever. NAF leverages neighborhood attention and rotation-based positional encoding to adaptively generate spatial-content joint weights directly from high-resolution inputs. Our method enables real-time upsampling of 2K images at 18 FPS. Extensive evaluations across semantic segmentation, depth estimation, and image inpainting demonstrate consistent superiority over both general-purpose and VFM-specific upsamplers. NAF is the first approach to simultaneously achieve high efficiency, high accuracy, and strong generalization across diverse VFMs. Code and pretrained models are publicly available.

Technology Category

Application Category

📝 Abstract

Vision Foundation Models (VFMs) extract spatially downsampled representations, posing challenges for pixel-level tasks. Existing upsampling approaches face a fundamental trade-off: classical filters are fast and broadly applicable but rely on fixed forms, while modern upsamplers achieve superior accuracy through learnable, VFM-specific forms at the cost of retraining for each VFM. We introduce Neighborhood Attention Filtering (NAF), which bridges this gap by learning adaptive spatial-and-content weights through Cross-Scale Neighborhood Attention and Rotary Position Embeddings (RoPE), guided solely by the high-resolution input image. NAF operates zero-shot: it upsamples features from any VFM without retraining, making it the first VFM-agnostic architecture to outperform VFM-specific upsamplers and achieve state-of-the-art performance across multiple downstream tasks. It maintains high efficiency, scaling to 2K feature maps and reconstructing intermediate-resolution maps at 18 FPS. Beyond feature upsampling, NAF demonstrates strong performance on image restoration, highlighting its versatility. Code and checkpoints are available at https://github.com/valeoai/NAF.

Problem

Research questions and friction points this paper is trying to address.

Upsampling spatially downsampled representations from Vision Foundation Models

Bridging the trade-off between speed and accuracy in feature upsampling

Achieving zero-shot VFM-agnostic upsampling without retraining requirements

Innovation

Methods, ideas, or system contributions that make the work stand out.

Learns adaptive weights via cross-scale attention

Uses rotary position embeddings for guidance

Zero-shot upsampling without retraining any VFM

🔎 Similar Papers

Enhancing Few-Shot Image Classification through Learnable Multi-Scale Embedding and Attention Mechanisms