🤖 AI Summary
Vision-language models (e.g., CLIP) often rely on spurious correlations—so-called “decision shortcuts”—in fine-grained image classification, leading to poor out-of-distribution generalization. To address this, we propose Test-time Prompt Erasure (TPE), a parameter-free inference-time method that dynamically identifies and suppresses non-causal shortcut features via learnable prompts, thereby strengthening task-invariant causal feature representations. TPE integrates test-time prompt tuning, feature disentanglement analysis, and a gradient-driven masking mechanism for spurious features. Crucially, it requires no model fine-tuning or architectural modification. Evaluated across multiple fine-grained classification benchmarks, TPE reduces average classification error by 12.7% compared to strong baselines, significantly outperforming existing test-time adaptation approaches. It markedly improves model robustness and cross-domain generalization without additional training overhead.
📝 Abstract
Vision-language foundation models have exhibited remarkable success across a multitude of downstream tasks due to their scalability on extensive image-text paired data. However, these models also display significant limitations when applied to downstream tasks, such as fine-grained image classification, as a result of ``decision shortcuts'' that hinder their generalization capabilities. In this work, we find that the CLIP model possesses a rich set of features, encompassing both extit{desired invariant causal features} and extit{undesired decision shortcuts}. Moreover, the underperformance of CLIP on downstream tasks originates from its inability to effectively utilize pre-trained features in accordance with specific task requirements. To address this challenge, we propose a simple yet effective method, Spurious Feature Eraser (SEraser), to alleviate the decision shortcuts by erasing the spurious features. Specifically, we introduce a test-time prompt tuning paradigm that optimizes a learnable prompt, thereby compelling the model to exploit invariant features while disregarding decision shortcuts during the inference phase. The proposed method effectively alleviates excessive dependence on potentially misleading spurious information. We conduct comparative analysis of the proposed method against various approaches which validates the significant superiority.