BREATH-VL: Vision-Language-Guided 6-DoF Bronchoscopy Localization via Semantic-Geometric Fusion

📅 2026-01-07
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of six-degree-of-freedom endoscopic localization in real-world clinical settings—namely, data scarcity, the difficulty of fine-grained pose regression, and high computational latency in temporal modeling—by introducing BREATH, the largest in vivo bronchoscopic localization dataset to date, and the BREATH-VL framework. BREATH-VL uniquely integrates the semantic understanding of vision-language models with geometric constraints from visual registration, and incorporates a lightweight temporal context learning mechanism that encodes motion history into language prompts for efficient temporal inference. Experimental results demonstrate that the proposed method significantly outperforms existing purely visual approaches while maintaining low computational latency, achieving a 25.5% reduction in translation error and exhibiting superior accuracy and generalization capability.

Technology Category

Application Category

📝 Abstract
Vision-language models (VLMs) have recently shown remarkable performance in navigation and localization tasks by leveraging large-scale pretraining for semantic understanding. However, applying VLMs to 6-DoF endoscopic camera localization presents several challenges: 1) the lack of large-scale, high-quality, densely annotated, and localization-oriented vision-language datasets in real-world medical settings; 2) limited capability for fine-grained pose regression; and 3) high computational latency when extracting temporal features from past frames. To address these issues, we first construct BREATH dataset, the largest in-vivo endoscopic localization dataset to date, collected in the complex human airway. Building on this dataset, we propose BREATH-VL, a hybrid framework that integrates semantic cues from VLMs with geometric information from vision-based registration methods for accurate 6-DoF pose estimation. Our motivation lies in the complementary strengths of both approaches: VLMs offer generalizable semantic understanding, while registration methods provide precise geometric alignment. To further enhance the VLM's ability to capture temporal context, we introduce a lightweight context-learning mechanism that encodes motion history as linguistic prompts, enabling efficient temporal reasoning without expensive video-level computation. Extensive experiments demonstrate that the vision-language module delivers robust semantic localization in challenging surgical scenes. Building on this, our BREATH-VL outperforms state-of-the-art vision-only localization methods in both accuracy and generalization, reducing translational error by 25.5% compared with the best-performing baseline, while achieving competitive computational latency.
Problem

Research questions and friction points this paper is trying to address.

vision-language models
6-DoF localization
endoscopic navigation
medical imaging
temporal feature extraction
Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-language model
6-DoF localization
semantic-geometric fusion
endoscopic navigation
temporal context learning
🔎 Similar Papers
No similar papers found.
Qingyao Tian
Qingyao Tian
Ph.D. candidate, Institute of Automation, Chinese Academy of Sciences
AI for healthcaremedical imagingfoundation models
B
Bingyu Yang
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China, and also with the School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China
H
Huai Liao
Department of Pulmonary and Critical Care Medicine, The First Affiliated Hospital of Sun Yat-sen University, Guangzhou, Guangdong Province, P.R. China
X
Xinyan Huang
Department of Pulmonary and Critical Care Medicine, The First Affiliated Hospital of Sun Yat-sen University, Guangzhou, Guangdong Province, P.R. China
J
Junyong Li
Centre of AI and Robotics, Hong Kong Institute of Science & Innovation, Chinese Academy of Sciences
Dong Yi
Dong Yi
Hong Kong Institute of Science and Innovation, Chinese Academy of Sciences
Computer VisionPattern Recognition
Hongbin Liu
Hongbin Liu
Chinese Academy of Sciences; King's College London
AI and Medical roboticsemboided AIMLLM