🤖 AI Summary
This work addresses the limitation of existing vision foundation models (VFMs), which typically rely on single-scale inputs during inference and thus overlook the complementary perceptual information embedded in multi-resolution imagery. To overcome this constraint, we propose MuRF—a training-free, architecture-agnostic multi-resolution fusion strategy that operates at inference time by extracting and integrating multi-scale features from frozen VFMs such as DINOv2 and SigLIP to construct a unified representation. Serving as a general-purpose plug-and-play module, MuRF consistently enhances the performance of diverse VFMs across a range of critical vision tasks, effectively transcending the inherent limitations of single-scale inference.
📝 Abstract
Vision Foundation Models (VFMs) have become the cornerstone of modern computer vision, offering robust representations across a wide array of tasks. While recent advances allow these models to handle varying input sizes during training, inference typically remains restricted to a single, fixed scale. This prevalent single-scale paradigm overlooks a fundamental property of visual perception: varying resolutions offer complementary inductive biases, where low-resolution views excel at global semantic recognition and high-resolution views are essential for fine-grained refinement. In this work, we propose Multi-Resolution Fusion (MuRF), a simple yet universally effective strategy to harness this synergy at inference time. Instead of relying on a single view, MuRF constructs a unified representation by processing an image at multiple resolutions through a frozen VFM and fusing the resulting features. The universality of MuRF is its most compelling attribute. It is not tied to a specific architecture, serving instead as a fundamental, training-free enhancement to visual representation. We empirically validate this by applying MuRF to a broad spectrum of critical computer vision tasks across multiple distinct VFM families - primarily DINOv2, but also demonstrating successful generalization to contrastive models like SigLIP2.