AffordTissue: Dense Affordance Prediction for Tool-Action Specific Tissue Interaction

📅 2026-04-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of accurately predicting instrument-tissue interaction sites in surgical automation and the lack of explicit modeling of safe operational regions for specific tool-action pairs. The authors propose a multimodal framework that integrates multi-view temporal visual encoding, language-conditioned guidance, and a DiT-style decoder to generate dense heatmap predictions of tool-action-specific operable regions during cholecystectomy. They introduce the first surgical operability benchmark dataset comprising 15,638 annotated clips across six tool-action categories. Their task-customized architecture substantially outperforms general-purpose vision-language models, reducing localization error to 20.6 pixels—significantly lower than the 60.2 pixels achieved by the Molmo-VLM baseline—thereby enhancing both spatial safety and interpretability in surgical automation.
📝 Abstract
Surgical action automation has progressed rapidly toward achieving surgeon-like dexterous control, driven primarily by advances in learning from demonstration and vision-language-action models. While these have demonstrated success in table-top experiments, translating them to clinical deployment remains challenging: current methods offer limited predictability on where instruments will interact on tissue surfaces and lack explicit conditioning inputs to enforce tool-action-specific safe interaction regions. Addressing this gap, we introduce AffordTissue, a multimodal framework for predicting tool-action specific tissue affordance regions as dense heatmaps during cholecystectomy. Our approach combines a temporal vision encoder capturing tool motion and tissue dynamics across multiple viewpoints, language conditioning enabling generalization across diverse instrument-action pairs, and a DiT-style decoder for dense affordance prediction. We establish the first tissue affordance benchmark by curating and annotating 15,638 video clips across 103 cholecystectomy procedures, covering six unique tool-action pairs involving four instruments (hook, grasper, scissors, clipper) and their associated tasks: dissection, grasping, clipping, and cutting. Experiments demonstrate substantial improvement over vision-language model baselines (20.6 px ASSD vs. 60.2 px for Molmo-VLM), showing that our task-specific architecture outperforms large-scale foundation models for dense surgical affordance prediction. By predicting tool-action specific tissue affordance regions, AffordTissue provides explicit spatial reasoning for safe surgical automation, potentially unlocking explicit policy guidance toward appropriate tissue regions and early safe stop when instruments deviate outside predicted safe zones.
Problem

Research questions and friction points this paper is trying to address.

surgical automation
tissue affordance
tool-action specificity
dense prediction
safe interaction regions
Innovation

Methods, ideas, or system contributions that make the work stand out.

dense affordance prediction
tool-action specificity
multimodal surgical framework
DiT-style decoder
surgical automation
🔎 Similar Papers
No similar papers found.