EndoVLA: Dual-Phase Vision-Language-Action Model for Autonomous Tracking in Endoscopy

📅 2025-05-21

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

In gastrointestinal endoscopic surgery, autonomous tracking of abnormal regions and faithful execution of circumferential cutting markings remain challenging; conventional models rely heavily on manual hyperparameter tuning and lack high-level intent understanding. Method: We propose EndoVLA—the first Vision-Language-Action (VLA) joint modeling framework tailored for continuum-robot endoscopes. It employs a two-stage fine-tuning paradigm (supervised fine-tuning followed by task-aware reinforcement fine-tuning), integrates a multimodal fusion encoder with an end-to-end action decoder, incorporates a task-specific reward mechanism, and introduces EndoVLA-Motion—the first benchmark dataset for endoscopic motion understanding. Contribution/Results: EndoVLA is the first VLA model adapted to endoscopic robotics, enabling zero-shot generalization across anatomical structures and surgical tasks. It significantly improves tracking robustness and precision while substantially reducing surgeon cognitive load.

Technology Category

Application Category

📝 Abstract

In endoscopic procedures, autonomous tracking of abnormal regions and following circumferential cutting markers can significantly reduce the cognitive burden on endoscopists. However, conventional model-based pipelines are fragile for each component (e.g., detection, motion planning) requires manual tuning and struggles to incorporate high-level endoscopic intent, leading to poor generalization across diverse scenes. Vision-Language-Action (VLA) models, which integrate visual perception, language grounding, and motion planning within an end-to-end framework, offer a promising alternative by semantically adapting to surgeon prompts without manual recalibration. Despite their potential, applying VLA models to robotic endoscopy presents unique challenges due to the complex and dynamic anatomical environments of the gastrointestinal (GI) tract. To address this, we introduce EndoVLA, designed specifically for continuum robots in GI interventions. Given endoscopic images and surgeon-issued tracking prompts, EndoVLA performs three core tasks: (1) polyp tracking, (2) delineation and following of abnormal mucosal regions, and (3) adherence to circular markers during circumferential cutting. To tackle data scarcity and domain shifts, we propose a dual-phase strategy comprising supervised fine-tuning on our EndoVLA-Motion dataset and reinforcement fine-tuning with task-aware rewards. Our approach significantly improves tracking performance in endoscopy and enables zero-shot generalization in diverse scenes and complex sequential tasks.

Problem

Research questions and friction points this paper is trying to address.

Autonomous tracking in endoscopy reduces cognitive burden on endoscopists

Conventional models lack generalization and require manual tuning

VLA models face challenges in dynamic GI tract environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-phase vision-language-action model for endoscopy

Supervised and reinforcement fine-tuning strategy

Zero-shot generalization in diverse endoscopic scenes

🔎 Similar Papers

Surgical-LVLM: Learning to Adapt Large Vision-Language Model for Grounded Visual Question Answering in Robotic Surgery