Supervised Classification Heads as Semantic Prototypes: Unlocking Vision-Language Alignment via Weight Recycling

📅 2026-05-21

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This work addresses the high computational cost and reliance on large-scale image-text paired data in existing vision-language alignment methods. It proposes, for the first time, repurposing the discarded supervised classification head weights from pretrained vision models as semantic prototypes, enabling efficient zero-shot and few-shot cross-modal alignment without requiring additional paired data. By integrating semantic prototype construction, posterior alignment, and data augmentation, the method consistently enhances the performance of multiple state-of-the-art alignment models across cross-modal retrieval, zero-shot classification, and few-shot classification tasks, significantly outperforming current baselines.

📝 Abstract

Vision-Language Models (VLMs) excel at tasks like zero-shot classification and cross-modal retrieval by mapping images and text to a shared space, but this requires expensive end-to-end training with massive paired datasets. Current post-hoc alignment methods reduce computational costs by connecting pretrained encoders through lightweight mappings, yet still demand substantial paired data. In this work, we investigate the potential of repurposing the classification heads of pretrained vision models as semantic prototypes. The recycling of these weights, typically discarded after pretraining, unlocks two distinct capabilities: it enables zero-shot alignment by using weights as semantic anchors, and serves as a robust data augmentation strategy by mixing these prototypes with real image-text pairs. We demonstrate that integrating our approach with several state-of-the-art post-hoc alignment techniques consistently boosts accuracy in cross-modal retrieval, zero- and few-shot classification tasks.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language Alignment

Zero-shot Classification

Cross-modal Retrieval

Weight Recycling

Semantic Prototypes

Innovation

Methods, ideas, or system contributions that make the work stand out.

weight recycling

semantic prototypes

vision-language alignment