🤖 AI Summary
Redundant or weakly correlated visual tokens degrade both the accuracy and efficiency of cross-modal semantic alignment in prompt tuning. To address this, we propose Prompt-Pruning, a lightweight framework that first constructs class-specific prototype memory banks, then adaptively selects the most representative visual tokens via a dual-perspective token activation assessment—evaluating both sample-level and semantic-level relevance—and a dynamically weighted two-tier ranking strategy. Built upon the CLIP architecture, our method integrates visual token selection, prototype matching, and token-prototype interaction scoring. Evaluated on 11 few-shot benchmarks, Prompt-Pruning achieves up to a 11.19% improvement in harmonic accuracy, accelerates inference by 0.8K FPS, and introduces only 21 trainable parameters—substantially outperforming standard CLIP and state-of-the-art prompt tuning approaches.
📝 Abstract
CLIP's success has demonstrated that prompt tuning can achieve robust cross-modal semantic alignment for tasks ranging from open-domain recognition to fine-grained classification. However, redundant or weakly relevant feature components introduce noise and incur unnecessary computational costs. In this work, we propose Spotlighter, a lightweight token-selection framework that simultaneously enhances accuracy and efficiency in prompt tuning. Spotlighter evaluates each visual token's activation from both sample-wise and semantic-wise perspectives and retains only the top-scoring tokens for downstream prediction. A class-specific semantic memory bank of learned prototypes refines this selection, ensuring semantic representativeness and compensating for discarded features. To further prioritize informative signals, we introduce a two-level ranking mechanism that dynamically weights token--prototype interactions. Across 11 few-shot benchmarks, Spotlighter outperforms CLIP by up to 11.19% in harmonic mean accuracy and achieves up to 0.8K additional FPS, with only 21 extra parameters. These results establish Spotlighter as an effective and scalable baseline for prompt tuning. Code for our method will be available at https://github.com/greatest-gourmet/Spotlighter.