Spotlighter: Revisiting Prompt Tuning from a Representative Mining View

📅 2025-08-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Redundant or weakly correlated visual tokens degrade both the accuracy and efficiency of cross-modal semantic alignment in prompt tuning. To address this, we propose Prompt-Pruning, a lightweight framework that first constructs class-specific prototype memory banks, then adaptively selects the most representative visual tokens via a dual-perspective token activation assessment—evaluating both sample-level and semantic-level relevance—and a dynamically weighted two-tier ranking strategy. Built upon the CLIP architecture, our method integrates visual token selection, prototype matching, and token-prototype interaction scoring. Evaluated on 11 few-shot benchmarks, Prompt-Pruning achieves up to a 11.19% improvement in harmonic accuracy, accelerates inference by 0.8K FPS, and introduces only 21 trainable parameters—substantially outperforming standard CLIP and state-of-the-art prompt tuning approaches.

Technology Category

Application Category

📝 Abstract
CLIP's success has demonstrated that prompt tuning can achieve robust cross-modal semantic alignment for tasks ranging from open-domain recognition to fine-grained classification. However, redundant or weakly relevant feature components introduce noise and incur unnecessary computational costs. In this work, we propose Spotlighter, a lightweight token-selection framework that simultaneously enhances accuracy and efficiency in prompt tuning. Spotlighter evaluates each visual token's activation from both sample-wise and semantic-wise perspectives and retains only the top-scoring tokens for downstream prediction. A class-specific semantic memory bank of learned prototypes refines this selection, ensuring semantic representativeness and compensating for discarded features. To further prioritize informative signals, we introduce a two-level ranking mechanism that dynamically weights token--prototype interactions. Across 11 few-shot benchmarks, Spotlighter outperforms CLIP by up to 11.19% in harmonic mean accuracy and achieves up to 0.8K additional FPS, with only 21 extra parameters. These results establish Spotlighter as an effective and scalable baseline for prompt tuning. Code for our method will be available at https://github.com/greatest-gourmet/Spotlighter.
Problem

Research questions and friction points this paper is trying to address.

Reduces noise from redundant visual tokens
Enhances accuracy and efficiency in prompt tuning
Selects top-scoring tokens for cross-modal alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight token-selection framework enhances accuracy
Semantic memory bank refines token selection representativeness
Two-level ranking mechanism weights token-prototype interactions dynamically
🔎 Similar Papers
No similar papers found.
Yutong Gao
Yutong Gao
Nanjing University of Science and Technology
computer visionNLPAIGC
M
Maoyuan Shao
School of Information Engineering, Minzu University of China
X
Xinyang Huang
School of Artificial Intelligence, Beijing University of Posts and Telecommunications
C
Chuang Zhu
School of Artificial Intelligence, Beijing University of Posts and Telecommunications
Lijuan Sun
Lijuan Sun
Johns Hopkins Univerisity
CancerImmunotherapymolecular biology
Y
Yu Weng
School of Information Engineering, Minzu University of China
X
Xuan Liu
School of Information Engineering, Minzu University of China
Guoshun Nan
Guoshun Nan
Professor of Beijing University of Posts and Telecommunications
Multimodal LearningVideo LLM6G SecuritySemantic Communications