Estimating 2D Keypoints of Surgical Tools Using Vision-Language Models with Low-Rank Adaptation

📅 2025-08-28

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

To address overfitting in few-shot 2D surgical instrument keypoint estimation from medical images, this paper proposes a lightweight method integrating vision-language models (VLMs) with low-rank adaptation (LoRA). The approach explicitly models the mapping between keypoint spatial coordinates and their semantic descriptions via semantic prompt engineering and vision–text feature alignment. LoRA enables efficient, parameter-efficient fine-tuning of a pre-trained VLM, substantially mitigating overfitting under extreme data scarcity. Experiments demonstrate that the method surpasses mainstream CNN- and Transformer-based baselines within only two training epochs. It achieves significant improvements in keypoint detection accuracy under low-resource conditions—e.g., fewer than 100 annotated samples—while maintaining interpretability and scalability. This work establishes a novel, generalizable paradigm for downstream 3D surgical pose estimation, bridging semantic understanding with geometric precision in minimal-data regimes.

Technology Category

Application Category

📝 Abstract

This paper presents a novel pipeline for 2D keypoint estima- tion of surgical tools by leveraging Vision Language Models (VLMs) fine- tuned using a low rank adjusting (LoRA) technique. Unlike traditional Convolutional Neural Network (CNN) or Transformer-based approaches, which often suffer from overfitting in small-scale medical datasets, our method harnesses the generalization capabilities of pre-trained VLMs. We carefully design prompts to create an instruction-tuning dataset and use them to align visual features with semantic keypoint descriptions. Experimental results show that with only two epochs of fine tuning, the adapted VLM outperforms the baseline models, demonstrating the ef- fectiveness of LoRA in low-resource scenarios. This approach not only improves keypoint detection performance, but also paves the way for future work in 3D surgical hands and tools pose estimation.

Problem

Research questions and friction points this paper is trying to address.

Estimating 2D keypoints for surgical tools using vision-language models

Overcoming overfitting in small-scale medical datasets with LoRA fine-tuning

Aligning visual features with semantic keypoint descriptions through instruction-tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging Vision-Language Models with LoRA fine-tuning

Aligning visual features with semantic keypoint descriptions

Outperforming baselines with minimal fine-tuning epochs

🔎 Similar Papers

Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures