🤖 AI Summary
Existing learning-based feature upsampling methods require encoder-specific training for diverse vision encoders (e.g., DINO, CLIP), suffering from poor generalizability. This paper proposes AnyUp—the first inference-time plug-and-play, feature-agnostic universal upsampling framework. Its core innovation lies in decoupling feature representation from the upsampling process: a lightweight network, channel-wise normalization, and multi-scale reconstruction jointly enable high-fidelity upsampling of arbitrary-resolution features across heterogeneous architectures—without relying on encoder-specific priors. Extensive experiments demonstrate that AnyUp achieves state-of-the-art performance on downstream tasks including semantic segmentation and object detection. It significantly improves semantic fidelity and computational efficiency while exhibiting strong cross-encoder generalization and practical deployability.
📝 Abstract
We introduce AnyUp, a method for feature upsampling that can be applied to any vision feature at any resolution, without encoder-specific training. Existing learning-based upsamplers for features like DINO or CLIP need to be re-trained for every feature extractor and thus do not generalize to different feature types at inference time. In this work, we propose an inference-time feature-agnostic upsampling architecture to alleviate this limitation and improve upsampling quality. In our experiments, AnyUp sets a new state of the art for upsampled features, generalizes to different feature types, and preserves feature semantics while being efficient and easy to apply to a wide range of downstream tasks.