🤖 AI Summary
Real-world visual perception demands invariance to geometric and photometric transformations—such as rotation, illumination variation, and color shifts—but existing approaches rely either on architecture-specific designs or predefined data augmentations, limiting generalizability. To address this, we propose FOCAL, the first test-time framework that leverages internet-scale priors from foundation models (e.g., CLIP, SAM) to generate and optimize candidate transformations, mapping inputs to canonical “normalized” views without fine-tuning or architectural modification. FOCAL thus enables data-driven normalization grounded in semantic and geometric consistency, eliminating dependence on transformation-specific training data. It offers a scalable pathway to robustness and facilitates novel applications such as active vision. Experiments demonstrate substantial improvements in perception robustness for CLIP and SAM under 2D/3D rotations, contrast variations, chromatic biases, and day–night domain shifts.
📝 Abstract
Real-world visual perception requires invariance to diverse transformations, yet current methods rely heavily on specialized architectures or training on predefined augmentations, limiting generalization. We propose FOCAL, a test-time, data-driven framework that achieves robust perception by leveraging internet-scale visual priors from foundation models. By generating and optimizing candidate transformations toward visually typical, "canonical" views, FOCAL enhances robustness without re-training or architectural changes. Our experiments demonstrate improved robustness of CLIP and SAM across challenging transformations, including 2D/3D rotations, illumination shifts (contrast and color), and day-night variations. We also highlight potential applications in active vision. Our approach challenges the assumption that transform-specific training is necessary, instead offering a scalable path to invariance. Our code is available at: https://github.com/sutkarsh/focal.