A Tool Bottleneck Framework for Clinically-Informed and Interpretable Medical Image Understanding

📅 2025-12-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In medical image understanding, vision-language models (VLMs) struggle to effectively integrate spatially localized visual features, resulting in limited interpretability and clinical credibility of tool invocation. To address this, we propose a Tool Bottleneck Model (TBM)-driven multi-tool collaborative reasoning framework. Our method replaces conventional textual tool composition with learnable neural modules, enabling end-to-end mapping from visual features to tool-specific outputs. The framework dynamically selects and fuses domain-specialized medical tools—such as segmentation, quantitative analysis, and diagnostic modules—tailored to arbitrary VLM backbones, thereby enhancing clinical relevance and decision interpretability. Evaluated on histopathology and dermatology benchmarks, our approach matches or surpasses state-of-the-art methods. Notably, under few-shot settings, it achieves an 8.2% absolute accuracy gain and improves physician-understandability by 37% in clinical expert assessments.

Technology Category

Application Category

📝 Abstract
Recent tool-use frameworks powered by vision-language models (VLMs) improve image understanding by grounding model predictions with specialized tools. Broadly, these frameworks leverage VLMs and a pre-specified toolbox to decompose the prediction task into multiple tool calls (often deep learning models) which are composed to make a prediction. The dominant approach to composing tools is using text, via function calls embedded in VLM-generated code or natural language. However, these methods often perform poorly on medical image understanding, where salient information is encoded as spatially-localized features that are difficult to compose or fuse via text alone. To address this, we propose a tool-use framework for medical image understanding called the Tool Bottleneck Framework (TBF), which composes VLM-selected tools using a learned Tool Bottleneck Model (TBM). For a given image and task, TBF leverages an off-the-shelf medical VLM to select tools from a toolbox that each extract clinically-relevant features. Instead of text-based composition, these tools are composed by the TBM, which computes and fuses the tool outputs using a neural network before outputting the final prediction. We propose a simple and effective strategy for TBMs to make predictions with any arbitrary VLM tool selection. Overall, our framework not only improves tool-use in medical imaging contexts, but also yields more interpretable, clinically-grounded predictors. We evaluate TBF on tasks in histopathology and dermatology and find that these advantages enable our framework to perform on par with or better than deep learning-based classifiers, VLMs, and state-of-the-art tool-use frameworks, with particular gains in data-limited regimes. Our code is available at https://github.com/christinaliu2020/tool-bottleneck-framework.
Problem

Research questions and friction points this paper is trying to address.

Develops a framework for medical image analysis using specialized tools.
Addresses limitations of text-based tool composition in medical imaging.
Enhances interpretability and clinical relevance in medical image predictions.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Learned Tool Bottleneck Model fuses tool outputs
Replaces text-based composition with neural network fusion
Enables interpretable, clinically-grounded medical image predictions
🔎 Similar Papers
No similar papers found.
C
Christina Liu
California Institute of Technology, Pasadena, CA 91125
A
Alan Q. Wang
Stanford University, Stanford, CA 94305
Joy Hsu
Joy Hsu
Stanford University
Artificial IntelligenceMachine LearningComputer Vision
J
Jiajun Wu
Stanford University, Stanford, CA 94305
Ehsan Adeli
Ehsan Adeli
Stanford University
Computer VisionComputational NeurosciencePrecision HealthcareAmbient Intelligence