ORION: ORthonormal Text Encoding for Universal VLM AdaptatION

πŸ“… 2026-02-23
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limited discriminability of text prototypes in existing vision-language models (VLMs) for zero-shot classification, which stems from handcrafted prompt engineering. The authors propose a method that leverages only class names to fine-tune the VLM’s text encoder via low-rank adaptation (LoRA) and introduces a novel loss function that jointly enforces inter-class orthogonality and prototype fidelity. This loss admits a probabilistic interpretation as a maximum likelihood estimator grounded in Huygens’ theorem, enabling a plug-and-play, model-agnostic adapter. Evaluated across 11 benchmark datasets and three mainstream VLMs, the approach consistently improves performance in zero-shot, few-shot, and test-time adaptation settings, demonstrating its effectiveness as a general-purpose module for enhancing existing VLM-based methods.

Technology Category

Application Category

πŸ“ Abstract
Vision language models (VLMs) have demonstrated remarkable generalization across diverse tasks, yet their performance remains constrained by the quality and geometry of the textual prototypes used to represent classes. Standard zero shot classifiers, derived from frozen text encoders and handcrafted prompts, may yield correlated or weakly separated embeddings that limit task specific discriminability. We introduce ORION, a text encoder fine tuning framework that improves pretrained VLMs using only class names. Our method optimizes, via low rank adaptation, a novel loss integrating two terms, one promoting pairwise orthogonality between the textual representations of the classes of a given task and the other penalizing deviations from the initial class prototypes. Furthermore, we provide a probabilistic interpretation of our orthogonality penalty, connecting it to the general maximum likelihood estimation (MLE) principle via Huygens theorem. We report extensive experiments on 11 benchmarks and three large VLM backbones, showing that the refined textual embeddings yield powerful replacements for the standard CLIP prototypes. Added as plug and play module on top of various state of the art methods, and across different prediction settings (zero shot, few shot and test time adaptation), ORION improves the performance consistently and significantly.
Problem

Research questions and friction points this paper is trying to address.

vision-language models
textual prototypes
zero-shot classification
class discriminability
embedding geometry
Innovation

Methods, ideas, or system contributions that make the work stand out.

orthonormal text encoding
low-rank adaptation
vision-language models
zero-shot classification
maximum likelihood estimation
πŸ”Ž Similar Papers
No similar papers found.