Leveraging Data to Say No: Memory Augmented Plug-and-Play Selective Prediction

๐Ÿ“… 2026-01-30
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Current vision-language foundation models lack effective mechanisms for abstaining from predictions in open-set, unbounded-vocabulary tasks such as image captioning, largely due to their reliance on closed-set assumptions. This work proposes PaPSP, a training-free, plug-and-play selective prediction method, and introducesโ€” for the first timeโ€”a memory-augmented variant (MA-PaPSP) that leverages an external retrieval dataset to construct mean embeddings of nearest neighbor image-text pairs, thereby reducing representation variance. By integrating contrastive normalization to enhance similarity calibration, the approach is compatible with any vision-language model and consistently outperforms existing baselines across diverse tasks including image captioning, image-text matching, and fine-grained classification. The proposed framework effectively addresses the dual challenges of representation instability and insufficient calibration in open-world settings.

Technology Category

Application Category

๐Ÿ“ Abstract
Selective prediction aims to endow predictors with a reject option, to avoid low confidence predictions. However, existing literature has primarily focused on closed-set tasks, such as visual question answering with predefined options or fixed-category classification. This paper considers selective prediction for visual language foundation models, addressing a taxonomy of tasks ranging from closed to open set and from finite to unbounded vocabularies, as in image captioning. We seek training-free approaches of low-complexity, applicable to any foundation model and consider methods based on external vision-language model embeddings, like CLIP. This is denoted as Plug-and-Play Selective Prediction (PaPSP). We identify two key challenges: (1) instability of the visual-language representations, leading to high variance in image-text embeddings, and (2) poor calibration of similarity scores. To address these issues, we propose a memory augmented PaPSP (MA-PaPSP) model, which augments PaPSP with a retrieval dataset of image-text pairs. This is leveraged to reduce embedding variance by averaging retrieved nearest-neighbor pairs and is complemented by the use of contrastive normalization to improve score calibration. Through extensive experiments on multiple datasets, we show that MA-PaPSP outperforms PaPSP and other selective prediction baselines for selective captioning, image-text matching, and fine-grained classification. Code is publicly available at https://github.com/kingston-aditya/MA-PaPSP.
Problem

Research questions and friction points this paper is trying to address.

selective prediction
vision-language models
open-set tasks
image captioning
foundation models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Selective Prediction
Memory Augmentation
Plug-and-Play
Vision-Language Models
Contrastive Normalization
๐Ÿ”Ž Similar Papers
No similar papers found.