🤖 AI Summary
This work addresses two critical challenges for vision-language models (VLMs) in real-world deployment: weak generalization under covariate shift and difficulty in out-of-distribution (OOD) class detection due to open-set semantic shift. We propose ΔEnergy—the first OOD scoring metric grounded in the change of modality-aligned energy—and theoretically prove that it simultaneously improves OOD detection performance and robustness to covariate shift. Methodologically, we formulate a unified energy-based vision-language alignment framework, jointly optimizing model parameters by maximizing a lower bound of ΔEnergy. To enhance alignment stability, we incorporate cosine similarity constraints and Hessian consistency regularization. Extensive experiments across multiple benchmarks demonstrate that our approach achieves significant AUROC gains of 10–25% over state-of-the-art methods, validating both its effectiveness and strong generalization capability under distributional shifts.
📝 Abstract
Recent approaches for vision-language models (VLMs) have shown remarkable success in achieving fast downstream adaptation. When applied to real-world downstream tasks, VLMs inevitably encounter both the in-distribution (ID) data and out-of-distribution (OOD) data. The OOD datasets often include both covariate shifts (e.g., known classes with changes in image styles) and semantic shifts (e.g., test-time unseen classes). This highlights the importance of improving VLMs' generalization ability to covariate-shifted OOD data, while effectively detecting open-set semantic-shifted OOD classes. In this paper, inspired by the substantial energy change observed in closed-set data when re-aligning vision-language modalities (specifically by directly reducing the maximum cosine similarity to a low value), we introduce a novel OOD score, named ΔEnergy. ΔEnergy significantly outperforms the vanilla energy-based OOD score and provides a more reliable approach for OOD detection. Furthermore, ΔEnergy can simultaneously improve OOD generalization under covariate shifts, which is achieved by lower-bound maximization for ΔEnergy (termed EBM). EBM is theoretically proven to not only enhance OOD detection but also yields a domain-consistent Hessian, which serves as a strong indicator for OOD generalization. Based on this finding, we developed a unified fine-tuning framework that allows for improving VLMs' robustness in both OOD generalization and OOD detection. Extensive experiments on challenging OOD detection and generalization benchmarks demonstrate the superiority of our method, outperforming recent approaches by 10% to 25% in AUROC.