Visual Sparse Steering: Improving Zero-shot Image Classification with Sparsity Guided Steering Vectors

📅 2025-06-02

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This work addresses zero-shot image classification under inference constraints: no fine-tuning, no labeled data, and limited computational resources. We propose Visual Sparse Steering (VS2), a training-free, fully unsupervised, lightweight test-time steering method for vision-language models (e.g., CLIP). VS2 employs a sparse autoencoder to extract top-k sparse features from visual embeddings and generates steering vectors that dynamically recalibrate vision–language semantic alignment. We further introduce VS2++—a retrieval-augmented variant—and PASS, a prototype-aligned sparse training strategy, both operating without supervision yet substantially enhancing fine-grained discrimination. Experiments show that VS2 improves zero-shot accuracy over CLIP by 4.12%, 1.08%, and 1.84% on CIFAR-100, CUB-200, and Tiny-ImageNet, respectively; VS2++ achieves up to a 21.44% gain; and PASS further boosts VS2’s performance by 6.12% on CIFAR-100.

Technology Category

Application Category

📝 Abstract

Steering vision foundation models at inference time without retraining or access to large labeled datasets is a desirable yet challenging objective, particularly in dynamic or resource-constrained settings. In this paper, we introduce Visual Sparse Steering (VS2), a lightweight, test-time method that guides vision models using steering vectors derived from sparse features learned by top-$k$ Sparse Autoencoders without requiring contrastive data. Specifically, VS2 surpasses zero-shot CLIP by 4.12% on CIFAR-100, 1.08% on CUB-200, and 1.84% on Tiny-ImageNet. We further propose VS2++, a retrieval-augmented variant that selectively amplifies relevant sparse features using pseudo-labeled neighbors at inference time. With oracle positive/negative sets, VS2++ achieves absolute top-1 gains over CLIP zero-shot of up to 21.44% on CIFAR-100, 7.08% on CUB-200, and 20.47% on Tiny-ImageNet. Interestingly, VS2 and VS2++ raise per-class accuracy by up to 25% and 38%, respectively, showing that sparse steering benefits specific classes by disambiguating visually or taxonomically proximate categories rather than providing a uniform boost. Finally, to better align the sparse features learned through the SAE reconstruction task with those relevant for downstream performance, we propose Prototype-Aligned Sparse Steering (PASS). By incorporating a prototype-alignment loss during SAE training, using labels only during training while remaining fully test-time unsupervised, PASS consistently, though modestly, outperforms VS2, achieving a 6.12% gain over VS2 only on CIFAR-100 with ViT-B/32.

Problem

Research questions and friction points this paper is trying to address.

Improving zero-shot image classification without retraining

Guiding vision models with sparse autoencoder features

Enhancing accuracy for specific visually ambiguous categories

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses sparse autoencoder-derived steering vectors

Enhances features with retrieval-augmented VS2++

Aligns sparse features via prototype-alignment loss

🔎 Similar Papers

No similar papers found.