๐ค AI Summary
This work addresses the limitations of existing Visual Prompt Tuning (VPT) methods, which often suffer from gradient oscillations, premature convergence in shallow layers, and high variance in deeper layers, leading to cross-layer inconsistency, slow convergence, and degraded performance. To overcome these issues, we propose a lightweight and general-purpose enhancement framework that initializes task-aware prompt directions via frequency-domain analysis, employs a globally shared Koopman operator to model the dynamic evolution of prompts for consistent cross-layer updates, and incorporates a Lyapunov stability theoryโinspired regularizer to suppress error amplification. Notably, our approach requires no modifications to the backbone network or inference pipeline and is compatible with various VPT variants. Experiments across 25 downstream tasks demonstrate an average 1.41ร acceleration in convergence and a 1โ3% improvement in accuracy.
๐ Abstract
Visual Prompt Tuning (VPT) adapts a frozen Vision Transformer (ViT) to downstream tasks by inserting a small number of learnable prompt tokens into the token sequence at each layer. However, we observe that existing VPT variants often suffer from unstable training dynamics, characterized by gradient oscillations. A layer-wise analysis reveals that shallow-layer prompts tend to stagnate early, while deeper-layer prompts exhibit high-variance oscillations, leading to cross-layer mismatch. These issues slow convergence and degrade final performance. To address these challenges, we propose Prompt-Agnostic Evolution ($\mathtt{PAE}$), which strengthens vision prompt tuning by explicitly modeling prompt dynamics. From a frequency-domain perspective, we initialize prompts in a task-aware direction by uncovering and propagating frequency shortcut patterns that the backbone inherently exploits for recognition. To ensure coherent evolution across layers, we employ a shared Koopman operator that imposes a global linear transformation instead of uncoordinated, layer-specific updates. Finally, inspired by Lyapunov stability theory, we introduce a regularizer that constrains error amplification during evolution. Extensive experiments show that $\mathtt{PAE}$ accelerates convergence with an average $1.41\times$ speedup and improves accuracy by 1-3% on 25 datasets across multiple downstream tasks. Beyond performance, $\mathtt{PAE}$ is prompt-agnostic and lightweight, and it integrates seamlessly with diverse VPT variants without backbone modification or inference-time changes.