SIEVES: Selective Prediction Generalizes through Visual Evidence Scoring

📅 2026-04-28
📈 Citations: 0
Influential: 0
📄 PDF

career value

191K/year
🤖 AI Summary
This work addresses the challenge that existing large vision-language models struggle to simultaneously satisfy user-specified risk constraints and maintain high answer coverage in out-of-distribution (OOD) scenarios. To this end, the authors propose a general selective prediction mechanism that prompts the model to generate localized visual evidence and employs a decoupled selector—separate from the reasoning model—to explicitly assess the quality of this evidence and determine whether to provide an answer. Notably, the approach requires neither fine-tuning nor access to internal model parameters, enabling direct application to black-box, closed-source models such as GPT-4o and Gemini-1.5-Pro. Evaluated across five OOD datasets, the method achieves up to a threefold increase in valid answer coverage over baseline approaches while demonstrating consistent generalization across diverse models.
📝 Abstract
Multimodal large language models (MLLMs) achieve ever-stronger performance on visual-language tasks. Even as traditional visual question answering benchmarks approach saturation, reliable deployment requires satisfying low error tolerances in real-world out-of-distribution (OOD) scenarios. Precisely, selective prediction aims to improve coverage, i.e. the share of inputs the system answers, while adhering to a user-defined risk level. This is typically achieved by assigning a confidence score to each answer and abstaining on those that fall below a certain threshold. To enable reliable generalization, we require reasoner models to produce localized visual evidence while answering, and design a selector that explicitly learns to estimate the quality of the localization provided by the reasoner. We show that SIEVES (Selective Prediction through Visual Evidence Scoring) improves coverage by up to three times on challenging OOD benchmarks (V* Bench, HR-Bench-8k, MME-RealWorld-Lite, VizWiz, and AdVQA), compared to non-grounding baselines. Beyond better generalization to OOD tasks, the design of the SIEVES selector enables transfer to proprietary reasoners without access to their weights or logits, such as o3 and Gemini-3-Pro, providing coverage boosts beyond those attributable to accuracy alone. We highlight that SIEVES generalizes across all five tested OOD datasets and reasoner models (Pixel-Reasoner, o3, and Gemini-3-Pro), without benchmark- or reasoner-specific training or adaptation.
Problem

Research questions and friction points this paper is trying to address.

selective prediction
out-of-distribution
coverage
multimodal large language models
visual-language tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Selective Prediction
Visual Evidence Scoring
Out-of-Distribution Generalization
Multimodal Reasoning
Model-Agnostic Selector