🤖 AI Summary
This work addresses the degraded cross-category generalization in few-shot generic anomaly detection caused by coarse-grained unified text prompts and fine-tuning on auxiliary data. To this end, the authors propose a symmetric residual-to-residual alignment framework operating in a CLIP-derived residual space. The method introduces, for the first time, a residual-to-residual alignment mechanism that jointly optimizes three branches—text prompts, visual prompts, and residual alignment—within a unified residual space. By modeling relative anomaly deviations rather than class-specific features, it effectively eliminates feature discrepancies among normal regions and mitigates category bias. Experiments demonstrate that the proposed approach significantly improves generalization performance for detecting anomalies in unseen categories under few-shot settings across multiple benchmark datasets.
📝 Abstract
Few-shot Generalist Anomaly Detection requires models to generalize to novel categories without retraining, posing significant challenges in real-world scenarios with scarce samples and rapidly changing categories. Existing CLIP-based methods face two major challenges: coarse-grained unified text prompts struggle to adapt to fine-grained foreground-background differences, causing cross-granularity mismatch; and fine-tuning on auxiliary datasets disrupts CLIP's inherent open-world generalization due to domain shift, leading to cross-category generalization degradation. To address these, we propose to shift multimodal alignment entirely into a unified residual space, where residual representations naturally eliminate fine-grained normal feature differences across regions and class-specific biases, simultaneously resolving both problems. Based on this insight, Res$^2$CLIP, the first residual-to-residual alignment framework that symmetrically bridges visual and text modalities within CLIP's residual space, is designed. The framework is developed from a residual perspective into three branches: a text prompt-based branch, a visual prompt-based branch, and a novel residual-to-residual alignment branch. All learnable optimizations are constrained within the residual domain, and the residual alignment optimization objectives are designed to force the model to focus on relative anomaly deviations rather than optimizing class-specific features. Experiments on multiple datasets demonstrate the effectiveness of our architecture. The code is available at https://github.com/hito2448/Res2CLIP.