🤖 AI Summary
This work addresses the critical challenge of mitigating high-cost errors caused by model uncertainty and ensuring controllable error rates under distribution shifts when deploying vision-language models in high-stakes scenarios. The authors propose a confidence-threshold-based active abstention mechanism that enables selective prediction in video question answering, dynamically balancing coverage and error rate. Leveraging the Gemini 2.0 Flash model and the NExT-QA dataset, they demonstrate for the first time that this approach significantly reduces error rates under in-distribution conditions while maintaining predictable error control under distributional shift, thereby offering a reliable deployment pathway for safety-critical applications.
📝 Abstract
High-stakes deployment of vision-language models (VLMs) requires selective prediction, where systems abstain when uncertain rather than risk costly errors. We investigate whether confidence-based abstention provides reliable control over error rates in video question answering, and whether that control remains robust under distribution shift. Using NExT-QA and Gemini 2.0 Flash, we establish two findings. First, confidence thresholding provides mechanistic control in-distribution. Sweeping threshold epsilon produces smooth risk-coverage tradeoffs, reducing error rates f