Upsample Anything: A Simple and Hard to Beat Baseline for Feature Upsampling

📅 2025-11-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision foundation models (e.g., ViT) employ high feature downsampling ratios (14×/16×), rendering their features unsuitable for pixel-level tasks without explicit upscaling. Existing upsampling methods rely on retraining or implicit optimization, limiting generalizability and scalability across architectures and modalities. To address this, we propose a **training-free, test-time adaptive** lightweight upsampling framework. Our method introduces an **anisotropic Gaussian kernel**, jointly modeling spatial and range cues to enable edge-aware, architecture- and modality-agnostic feature reconstruction. By integrating Gaussian lattices with joint bilinear interpolation principles, it dynamically learns image-specific adaptive upsampling kernels at inference time. Evaluated on semantic segmentation, depth estimation, and probabilistic map upsampling, our approach achieves state-of-the-art performance. On a 224×224 input, single-image inference takes only 0.419 seconds.

Technology Category

Application Category

📝 Abstract
We present extbf{Upsample Anything}, a lightweight test-time optimization (TTO) framework that restores low-resolution features to high-resolution, pixel-wise outputs without any training. Although Vision Foundation Models demonstrate strong generalization across diverse downstream tasks, their representations are typically downsampled by 14x/16x (e.g., ViT), which limits their direct use in pixel-level applications. Existing feature upsampling approaches depend on dataset-specific retraining or heavy implicit optimization, restricting scalability and generalization. Upsample Anything addresses these issues through a simple per-image optimization that learns an anisotropic Gaussian kernel combining spatial and range cues, effectively bridging Gaussian Splatting and Joint Bilateral Upsampling. The learned kernel acts as a universal, edge-aware operator that transfers seamlessly across architectures and modalities, enabling precise high-resolution reconstruction of features, depth, or probability maps. It runs in only $approx0.419 ext{s}$ per 224x224 image and achieves state-of-the-art performance on semantic segmentation, depth estimation, and both depth and probability map upsampling.
Problem

Research questions and friction points this paper is trying to address.

Restoring low-resolution features to high-resolution pixel outputs without training
Addressing feature downsampling limitations in Vision Foundation Models for pixel-level applications
Providing universal edge-aware upsampling that transfers across architectures and modalities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Test-time optimization without training
Learns anisotropic Gaussian kernel
Universal edge-aware upsampling operator
🔎 Similar Papers
No similar papers found.