Intrinsic Gradient Suppression for Label-Noise Prompt Tuning in Vision-Language Models

๐Ÿ“… 2026-05-01
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

196K/year
๐Ÿค– AI Summary
This work addresses the sensitivity of vision-language model prompt tuning to label noise, where mislabeled samples often induce excessively large gradients that disrupt pretrained priors. To mitigate this issue, the authors propose a conservative prompt tuning strategy that leverages CLIPโ€™s near-optimal initialization and innovatively repurposes the traditionally detrimental vanishing gradient phenomenon as a noise-filtering mechanism. By introducing a dual-Softmax architecture, the method performs sequence-wise probability normalization and adaptively constructs saturation regions to suppress gradient updates from high-error samplesโ€”all without requiring additional hyperparameters. Extensive experiments demonstrate that this approach significantly enhances robustness across multiple noisy benchmarks, outperforming existing methods that rely on complex architectures or manual hyperparameter tuning.
๐Ÿ“ Abstract
Contrastive vision-language models like CLIP exhibit remarkable zero-shot generalization. However, prompt tuning remains highly sensitive to label noise, as mislabeled samples generate disproportionately large gradients that can overwhelm pre-trained priors. We argue that because CLIP already provides a near-optimal initialization, adaptation should be inherently conservative, particularly against the extreme gradient updates common in noisy settings. To this end, we propose Double-Softmax Prompt Tuning (DSPT), a hyperparameter-free method for intrinsic gradient suppression. By applying a sequential probabilistic normalization, DSPT induces a self-adaptive saturation zone that suppresses gradients from high-error noisy samples while maintaining informative updates. We also provide both theoretical analysis and empirical evidence about how this mechanism achieves adaptive suppression. This design transforms ``gradient vanishing'', traditionally a training bottleneck, into a principled noise-filtering shield for label-noise prompt tuning. Extensive experiments confirm that this simple, drop-in design achieves state-of-the-art robustness across various noisy benchmarks, outperforming methods with complex architectures and handcrafted hyperparameters.
Problem

Research questions and friction points this paper is trying to address.

label noise
prompt tuning
vision-language models
gradient suppression
CLIP
Innovation

Methods, ideas, or system contributions that make the work stand out.

prompt tuning
label noise
gradient suppression
vision-language models
CLIP