Spotlighter: Revisiting Prompt Tuning from a Representative Mining View

📅 2025-08-31

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Redundant or weakly correlated visual tokens degrade both the accuracy and efficiency of cross-modal semantic alignment in prompt tuning. To address this, we propose Prompt-Pruning, a lightweight framework that first constructs class-specific prototype memory banks, then adaptively selects the most representative visual tokens via a dual-perspective token activation assessment—evaluating both sample-level and semantic-level relevance—and a dynamically weighted two-tier ranking strategy. Built upon the CLIP architecture, our method integrates visual token selection, prototype matching, and token-prototype interaction scoring. Evaluated on 11 few-shot benchmarks, Prompt-Pruning achieves up to a 11.19% improvement in harmonic accuracy, accelerates inference by 0.8K FPS, and introduces only 21 trainable parameters—substantially outperforming standard CLIP and state-of-the-art prompt tuning approaches.

Technology Category

Application Category

📝 Abstract

CLIP's success has demonstrated that prompt tuning can achieve robust cross-modal semantic alignment for tasks ranging from open-domain recognition to fine-grained classification. However, redundant or weakly relevant feature components introduce noise and incur unnecessary computational costs. In this work, we propose Spotlighter, a lightweight token-selection framework that simultaneously enhances accuracy and efficiency in prompt tuning. Spotlighter evaluates each visual token's activation from both sample-wise and semantic-wise perspectives and retains only the top-scoring tokens for downstream prediction. A class-specific semantic memory bank of learned prototypes refines this selection, ensuring semantic representativeness and compensating for discarded features. To further prioritize informative signals, we introduce a two-level ranking mechanism that dynamically weights token--prototype interactions. Across 11 few-shot benchmarks, Spotlighter outperforms CLIP by up to 11.19% in harmonic mean accuracy and achieves up to 0.8K additional FPS, with only 21 extra parameters. These results establish Spotlighter as an effective and scalable baseline for prompt tuning. Code for our method will be available at https://github.com/greatest-gourmet/Spotlighter.

Problem

Research questions and friction points this paper is trying to address.

Reduces noise from redundant visual tokens

Enhances accuracy and efficiency in prompt tuning

Selects top-scoring tokens for cross-modal alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight token-selection framework enhances accuracy

Semantic memory bank refines token selection representativeness

Two-level ranking mechanism weights token-prototype interactions dynamically

🔎 Similar Papers

A Large Language Model Guided Topic Refinement Mechanism for Short Text Modeling