When to Trust the Answer: Question-Aligned Semantic Nearest Neighbor Entropy for Safer Surgical VQA

πŸ“… 2025-11-03
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Surgical visual question answering (VQA) faces critical challenges in reliably quantifying model uncertainty under high-stakes clinical settings, risking erroneous decision-making. To address this, we propose Question-Aligned Semantic Nearest-Neighbor Entropy (QA-SNNE)β€”the first method to incorporate question semantics into semantic entropy estimation of model answers, enabling black-box uncertainty quantification. QA-SNNE jointly leverages nearest-neighbor analysis within medical text embedding spaces and outputs from large vision-language models (LVLMs). It significantly enhances sensitivity to ambiguous, ill-defined, and hallucinated responses, facilitating automated failure detection and safe referral. Extensive evaluation across multiple LVLMs demonstrates that QA-SNNE improves AUROC by 15–38% over baselines and maintains robustness under zero-shot and out-of-template stress conditions. This work establishes a novel, interpretable, plug-and-play uncertainty quantification paradigm for clinically trustworthy deployment of surgical VQA systems.

Technology Category

Application Category

πŸ“ Abstract
Safety and reliability are essential for deploying Visual Question Answering (VQA) in surgery, where incorrect or ambiguous responses can harm the patient. Most surgical VQA research focuses on accuracy or linguistic quality while overlooking safety behaviors such as ambiguity awareness, referral to human experts, or triggering a second opinion. Inspired by Automatic Failure Detection (AFD), we study uncertainty estimation as a key enabler of safer decision making. We introduce Question Aligned Semantic Nearest Neighbor Entropy (QA-SNNE), a black box uncertainty estimator that incorporates question semantics into prediction confidence. It measures semantic entropy by comparing generated answers with nearest neighbors in a medical text embedding space, conditioned on the question. We evaluate five models, including domain specific Parameter-Efficient Fine-Tuned (PEFT) models and zero-shot Large Vision-Language Models (LVLMs), on EndoVis18-VQA and PitVQA. PEFT models degrade under mild paraphrasing, while LVLMs are more resilient. Across three LVLMs and two PEFT baselines, QA-SNNE improves AUROC in most in-template settings and enhances hallucination detection. The Area Under the ROC Curve (AUROC) increases by 15-38% for zero-shot models, with gains maintained under out-of-template stress. QA-SNNE offers a practical and interpretable step toward AFD in surgical VQA by linking semantic uncertainty to question context. Combining LVLM backbones with question aligned uncertainty estimation can improve safety and clinician trust. The code and model are available at https://github.com/DennisPierantozzi/QASNNE
Problem

Research questions and friction points this paper is trying to address.

Estimating uncertainty for safer surgical visual question answering
Detecting model hallucinations and ambiguous responses in medical VQA
Improving failure detection through question-aligned semantic entropy
Innovation

Methods, ideas, or system contributions that make the work stand out.

QA-SNNE estimates uncertainty using semantic nearest neighbors
Incorporates question semantics into medical text embeddings
Improves hallucination detection for surgical VQA models
πŸ”Ž Similar Papers
No similar papers found.
D
Dennis Pierantozzi
Dipartimento di Elettronica, Informazione e Bioingegneria (DEIB), Politecnico di Milano, Italy.
L
Luca Carlini
Dipartimento di Elettronica, Informazione e Bioingegneria (DEIB), Politecnico di Milano, Italy.
M
Mauro Orazio Drago
Dipartimento di Elettronica, Informazione e Bioingegneria (DEIB), Politecnico di Milano, Italy.
C
Chiara Lena
Dipartimento di Elettronica, Informazione e Bioingegneria (DEIB), Politecnico di Milano, Italy.
C
Cesare Hassan
IRCCS Humanitas Research Hospital, Italy.
Elena De Momi
Elena De Momi
Politecnico di Milano
medical roboticscomputer visionartificial intelligencehuman robot interaction
Danail Stoyanov
Danail Stoyanov
Professor of Robot Vision, University College London
Surgical VisionSurgical AISurgical RoboticsComputer Assisted InterventionsSurgical Data Science
Sophia Bano
Sophia Bano
Assistant Professor in Robotics and AI, University College London
Computer VisionSurgical Data ScienceSurgical RoboticsComputer-assisted InterventionMedical Imaging
M
Mobarak I. Hoque
UCL Hawkes Institute and Department of Computer Science, University College London, UK; Division of Informatics, Imaging and Data Science, University of Manchester, UK.