🤖 AI Summary
Medical visual-language segmentation models (VLSMs) suffer from high computational costs during fine-tuning, while existing parameter-efficient fine-tuning (PEFT) methods employ rigid adapter dimensions ill-suited to varying semantic demands across Transformer layers. To address this, we propose a depth-aware progressive scaling adapter architecture, introducing the novel “telescopic adapter”—the first adapter design that dynamically expands its dimensionality according to layer depth and semantic importance in the Transformer, empirically validating the hypothesis that deeper layers require greater adaptation capacity. Integrated into both vision and text encoders of CLIPSeg, our lightweight bottleneck modules adopt a depth-aware linear scaling strategy to enable variable-dimensional adaptation. With only 613K trainable parameters—244× fewer than full fine-tuning—our method achieves state-of-the-art segmentation performance across five cross-modal medical datasets (e.g., colon polyps, skin lesions, breast ultrasound), significantly enhancing deployability in resource-constrained clinical settings.
📝 Abstract
Adapting Vision Language Segmentation Models (VLSMs) to medical imaging domains requires significant computational overhead when using conventional fine-tuning approaches. Existing Parameter-Efficient Fine-Tuning (PEFT) methods apply uniform adapter dimensions across all transformer layers, leading to suboptimal parameter allocation and reduced adaptation efficiency. We introduce Telescopic Adapters, a novel PEFT framework that employs depth-aware scaling to progressively increase adapter capacity from shallow to deep transformer layers. Our method integrates lightweight bottleneck modules within CLIPSeg's vision and text encoders, with adapter dimensions dynamically scaled based on layer depth and semantic relevance. Using only 613k trainable parameters--244x fewer than end-to-end fine-tuning, Telescopic Adapters achieve superior performance across five diverse medical datasets spanning polyp segmentation, skin lesion detection, and breast ultrasound imaging. Comprehensive ablation studies demonstrate that deeper layers require substantially more adaptation capacity than shallow layers, validating our telescopic scaling hypothesis. Our approach establishes a new paradigm for efficient medical VLSM fine-tuning, enabling deployment in resource-constrained clinical environments while maintaining competitive segmentation accuracy.