When to Trust the Answer: Question-Aligned Semantic Nearest Neighbor Entropy for Safer Surgical VQA

📅 2025-11-03

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

Surgical visual question answering (VQA) faces critical challenges in reliably quantifying model uncertainty under high-stakes clinical settings, risking erroneous decision-making. To address this, we propose Question-Aligned Semantic Nearest-Neighbor Entropy (QA-SNNE)—the first method to incorporate question semantics into semantic entropy estimation of model answers, enabling black-box uncertainty quantification. QA-SNNE jointly leverages nearest-neighbor analysis within medical text embedding spaces and outputs from large vision-language models (LVLMs). It significantly enhances sensitivity to ambiguous, ill-defined, and hallucinated responses, facilitating automated failure detection and safe referral. Extensive evaluation across multiple LVLMs demonstrates that QA-SNNE improves AUROC by 15–38% over baselines and maintains robustness under zero-shot and out-of-template stress conditions. This work establishes a novel, interpretable, plug-and-play uncertainty quantification paradigm for clinically trustworthy deployment of surgical VQA systems.

Technology Category

Application Category

📝 Abstract

Safety and reliability are essential for deploying Visual Question Answering (VQA) in surgery, where incorrect or ambiguous responses can harm the patient. Most surgical VQA research focuses on accuracy or linguistic quality while overlooking safety behaviors such as ambiguity awareness, referral to human experts, or triggering a second opinion. Inspired by Automatic Failure Detection (AFD), we study uncertainty estimation as a key enabler of safer decision making. We introduce Question Aligned Semantic Nearest Neighbor Entropy (QA-SNNE), a black box uncertainty estimator that incorporates question semantics into prediction confidence. It measures semantic entropy by comparing generated answers with nearest neighbors in a medical text embedding space, conditioned on the question. We evaluate five models, including domain specific Parameter-Efficient Fine-Tuned (PEFT) models and zero-shot Large Vision-Language Models (LVLMs), on EndoVis18-VQA and PitVQA. PEFT models degrade under mild paraphrasing, while LVLMs are more resilient. Across three LVLMs and two PEFT baselines, QA-SNNE improves AUROC in most in-template settings and enhances hallucination detection. The Area Under the ROC Curve (AUROC) increases by 15-38% for zero-shot models, with gains maintained under out-of-template stress. QA-SNNE offers a practical and interpretable step toward AFD in surgical VQA by linking semantic uncertainty to question context. Combining LVLM backbones with question aligned uncertainty estimation can improve safety and clinician trust. The code and model are available at https://github.com/DennisPierantozzi/QASNNE

Problem

Research questions and friction points this paper is trying to address.

Estimating uncertainty for safer surgical visual question answering

Detecting model hallucinations and ambiguous responses in medical VQA

Improving failure detection through question-aligned semantic entropy

Innovation

Methods, ideas, or system contributions that make the work stand out.

QA-SNNE estimates uncertainty using semantic nearest neighbors

Incorporates question semantics into medical text embeddings

Improves hallucination detection for surgical VQA models

🔎 Similar Papers

Surgical-LVLM: Learning to Adapt Large Vision-Language Model for Grounded Visual Question Answering in Robotic Surgery