🤖 AI Summary
This work addresses the challenge of deploying large-scale foundation models for object segmentation in optical remote sensing imagery, where full-parameter fine-tuning incurs prohibitive memory and computational costs. To this end, we propose WEFT, an efficient fine-tuning approach guided by dynamic wavelet experts. WEFT introduces, for the first time in remote sensing segmentation, a learnable wavelet expert extractor coupled with a conditional adapter, which enhances the fine-grained perceptual capabilities of a frozen foundation model while tuning only a minimal number of parameters. By integrating wavelet transformation, dynamic expert mechanisms, and parameter-efficient fine-tuning, WEFT outperforms 21 state-of-the-art methods across three remote sensing benchmarks and achieves superior performance on camouflaged, natural, and medical image segmentation tasks, significantly reducing training resource consumption.
📝 Abstract
Accurately localizing and segmenting relevant objects from optical remote sensing images (ORSIs) is critical for advancing remote sensing applications. Existing methods are typically built upon moderate-scale pre-trained models and employ diverse optimization strategies to achieve promising performance under full-parameter fine-tuning. In fact, deeper and larger-scale foundation models can provide stronger support for performance improvement. However, due to their massive number of parameters, directly adopting full-parameter fine-tuning leads to pronounced training difficulties, such as excessive GPU memory consumption and high computational costs, which result in extremely limited exploration of large-scale models in existing works. In this paper, we propose a novel dynamic wavelet expert-guided fine-tuning paradigm with fewer trainable parameters, dubbed WEFT, which efficiently adapts large-scale foundation models to ORSIs segmentation tasks by leveraging the guidance of wavelet experts. Specifically, we introduce a task-specific wavelet expert extractor to model wavelet experts from different perspectives and dynamically regulate their outputs, thereby generating trainable features enriched with task-specific information for subsequent fine-tuning. Furthermore, we construct an expert-guided conditional adapter that first enhances the fine-grained perception of frozen features for specific tasks by injecting trainable features, and then iteratively updates the information of both types of feature, allowing for efficient fine-tuning. Extensive experiments show that our WEFT not only outperforms 21 state-of-the-art (SOTA) methods on three ORSIs datasets, but also achieves optimal results in camouflage, natural, and medical scenarios. The source code is available at: https://github.com/CSYSI/WEFT.