Test-time Loss Landscape Adaptation for Zero-Shot Generalization in Vision-Language Models

📅 2025-01-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses test-time zero-shot generalization of vision-language models under distribution shift, proposing TLLA—a novel, efficient, backpropagation-free and parameter-update-free adaptive paradigm. Methodologically, TLLA innovatively identifies redundancy in test-time gradient optimization by analyzing loss surface flatness—a first-of-its-kind insight—and builds a gradient-free prompt optimization framework comprising Sharpness-Aware Prompt Tuning (SAPT), Sharpness-based Test Sample Selection (STSS), and loss-surface geometric alignment. Evaluated on four ImageNet variants, TLLA achieves state-of-the-art performance, outperforming TPT by 5.32% (ResNet-50) and 6.98% (ViT-B/16) in zero-shot accuracy, while significantly reducing computational overhead. The approach establishes a new benchmark for efficient, geometry-aware test-time adaptation without parameter updates.

Technology Category

Application Category

📝 Abstract
Test-time adaptation of pre-trained vision-language models has emerged as a technique for tackling distribution shifts during the test time. Although existing methods, especially those based on Test-time Prompt Tuning (TPT), have shown promising results, their high computational cost associated with parameter optimization presents challenges for scalability and practical application. This paper unveils the unnecessary nature of backpropagation in existing methods from a loss landscape perspective. Building on this insight, this paper proposes a simple yet effective framework called Test-time Loss Landscape Adaptation (TLLA). TLLA leverages the relative position between the training minimum and test loss landscapes to guide the adaptation process, avoiding the update of model parameters at test time. Specifically, it mainly consists of two main stages: In the prompt tuning stage, a Sharpness-Aware Prompt Tuning (SAPT) method is introduced to identify the training flat minimum, setting the foundation for the subsequent test-time adaptation; In the test stage, a Sharpness-based Test Sample Selection (STSS) approach is utilized to ensure the alignment of flat minima within the training loss landscape and each augmented test sample's loss landscape. Extensive experiments on both domain generalization and cross-dataset benchmarks demonstrate that TLLA achieves state-of-the-art performances while significantly reducing computational overhead. Notably, TLLA surpasses TPT by an average of 5.32% and 6.98% on four ImageNet variant datasets when employing ResNet50 and ViT-B/16 image encoders, respectively. The code will be available soon.
Problem

Research questions and friction points this paper is trying to address.

Pre-trained Visual Language Models
Generalization on Unseen Data
Reducing Computational Cost
Innovation

Methods, ideas, or system contributions that make the work stand out.

TLLA
SAPT
STSS
🔎 Similar Papers
No similar papers found.
A
Aodi Li
University of Science and Technology of China, Hefei 230026, China
Liansheng Zhuang
Liansheng Zhuang
Univerisity of Science and Technology of China
Computer VisionKnowledge GraphComputer Games
Xiao Long
Xiao Long
University of Science and Technology of China | Alibaba Group
Knowledge GraphLarge Language ModelReasoning
M
Minghong Yao
University of Science and Technology of China, Hefei 230026, China
S
Shafei Wang
Peng Cheng Laboratory, Shenzhen 518000, China