π€ AI Summary
This study addresses key challenges in clinical free-textβguided segmentation of pulmonary medical images, including semantic ambiguity, anatomical structure overlap, and overfitting of large models under limited training data. To tackle these issues, the authors propose a novel framework that integrates a large language model (LLaMA-3-V) with a vision foundation model (MedSAM). The approach leverages text-to-vision intent distillation to extract diagnostic guidance and formulates lesion mask selection as a dynamic semantic-topological graph reasoning problem. A selective asymmetric fine-tuning strategy is introduced, updating fewer than 1% of model parameters. Evaluated on the LIDC-IDRI dataset, the method achieves a Dice coefficient of 81.5%, outperforming state-of-the-art approaches such as LISA by over 5%, while exhibiting high stability with a five-fold cross-validation variance of only 0.6%.
π Abstract
Medical image segmentation driven by free-text clinical instructions is a critical frontier in computer-aided diagnosis. However, existing multimodal and foundation models struggle with the semantic ambiguity of clinical reports and fail to disambiguate complex anatomical overlaps in low-contrast scans. Furthermore, fully fine-tuning these massive architectures on limited medical datasets invariably leads to severe overfitting. To address these challenges, we propose a novel Semantic-Topological Graph Reasoning (STGR) framework for language-guided pulmonary screening. Our approach elegantly synergizes the reasoning capabilities of large language models (LLaMA-3-V) with the zero-shot delineation of vision foundation models (MedSAM). Specifically, we introduce a Text-to-Vision Intent Distillation (TVID) module to extract precise diagnostic guidance. To resolve anatomical ambiguity, we formulate mask selection as a dynamic graph reasoning problem, where candidate lesions are modeled as nodes and edges capture spatial and semantic affinities. To ensure deployment feasibility, we introduce a Selective Asymmetric Fine-Tuning (SAFT) strategy that updates less than 1% of the parameters. Rigorous 5-fold cross-validation on the LIDC-IDRI and LNDb datasets demonstrates that our framework establishes a new state-of-the-art. Notably, it achieves an 81.5% Dice Similarity Coefficient (DSC) on LIDC-IDRI, outperforming leading LLM-based tools like LISA by over 5%. Crucially, our SAFT strategy acts as a powerful regularizer, yielding exceptional cross-fold stability (0.6% DSC variance) and paving the way for robust, context-aware clinical deployment.