🤖 AI Summary
Existing text adapters model class-level textual features deterministically, failing to capture intra-class descriptive diversity and inter-class semantic relationships—thereby limiting downstream transfer performance of vision-language models (VLMs). To address this, we propose the Stochastic Graph Adapter (SGA), which constructs a vertex-stochastic knowledge graph to jointly model intra-class textual heterogeneity and inter-class structural dependencies. SGA incorporates probabilistic message passing and reparameterized sampling to enable uncertainty-aware textual representation learning, and introduces an uncertainty-guided multi-branch fusion mechanism for dynamic ensemble. Extensive experiments across multiple vision-language benchmarks demonstrate that SGA significantly improves fine-tuning performance while delivering enhanced robustness and generalization capability compared to deterministic adapters.
📝 Abstract
Textual adapter-based tuning methods have shown significant potential in transferring knowledge from pre-trained Vision-Language Models (VLMs) to downstream tasks. Existing works generally employ the deterministic textual feature adapter to refine each category textual representation. However, due to inherent factors such as different attributes and contexts, there exists significant diversity in textual descriptions for each category. Such description diversity offers rich discriminative semantic knowledge that can benefit downstream visual learning tasks. Obviously, traditional deterministic adapter model cannot adequately capture this varied semantic information. Also, it is desirable to exploit the inter-class relationships in VLM adapter. To address these issues, we propose to exploit random graph model into VLM adapter and develop a novel Vertex Random Graph Adapter (VRGAdapter). VRGAdapter first models the inherent diverse descriptions of each category and inter-class relationships of different categories simultaneously by leveraging a Vertex Random Knowledge Graph (VRKG) model. Then, it employs probabilistic message propagation on VRKG to learn context-aware distribution representation for each class node. Finally, it adopts a reparameterized sampling function to achieve textual adapter learning. Note that, VRGAdapter provides a more general adapter solution that encompasses traditional graph-based adapter as a special case. In addition, to enable more robust performance for downstream tasks, we also introduce a new Uncertainty-guided Multi-branch Fusion (UMF) scheme that dynamically integrates multiple pre-trained models for ensemble prediction. Extensive experiments on multiple benchmark datasets demonstrate the effectiveness of our approach.