🤖 AI Summary
Existing vision prompt tuning methods, which rely on continuous dense prompts, are prone to overfitting irrelevant details and sensitive to input noise, making it challenging to balance accuracy and robustness. This work proposes Spike-NVPT, the first approach to incorporate biologically inspired spiking neuron mechanisms into visual prompt learning. By leveraging Integrate-and-Fire units, the method accumulates task-relevant signals over time while filtering out transient noise, subsequently discretizing the filtered output into sparse binary prompts. Notably, Spike-NVPT incurs no additional computational overhead during inference and achieves competitive accuracy on clean data while improving noise robustness by up to 11.2%.
📝 Abstract
Pre-trained vision models have found widespread application across diverse domains. Prompt tuning-based methods have emerged as a parameter-efficient paradigm for adapting pre-trained vision models. While effective on standard benchmarks, the continuous and dense nature of learned prompts can lead to sensitivity against input noise, as the high-capacity prompts tend to overfit task-irrelevant details. To address this trade-off, we propose Spike-NVPT, a noise-robust visual prompt tuning method. Specifically, we design a Signal Filtering Layer based on spiking neurons, which uses the integrate-and-fire (IF) mechanism to accumulate task-relevant signals over time and filter transient noise fluctuations. A subsequent Spike Discretization Unit converts filtered signals into sparse binary prompts. This discretization acts as a strong regularizer, forcing the model to anchor decision boundaries on the most discriminative and robust features. Notably, the resulting binary prompts remain static during deployment, ensuring zero additional computational overhead during inference. Experimental results demonstrate that Spike-NVPT achieves superior robustness performance, with a maximum improvement of 11.2% over conventional methods, and retains competitive accuracy on clean datasets. To the best of our knowledge, this is the first attempt to leverage spiking neurons for fine-tuning traditional artificial neural network (ANN)-based visual models.