π€ AI Summary
Large-scale foundation models face prohibitive computational costs when fully fine-tuned for object segmentation, while existing prompt-based tuning methods lack sufficient semantic priors. To address this, we propose an efficient frozen-tuning framework grounded in dynamic local priors. Our method introduces a dynamic hybrid local prior extractor and a bidirectional interaction adapter, integrating heterogeneous convolutions, gated routing, and cosine-aligned deformable attention to adaptively generate semantically rich local priors. Additionally, we incorporate a dynamic Mixture-of-Experts (MoE) mechanism to modulate the frozen modelβs local perception capability. With fewer than 0.1% trainable parameters, our approach surpasses 31 state-of-the-art methods across multiple binary segmentation benchmarks, achieving significant improvements in accuracy, generalization, and training efficiency. The code is publicly available.
π Abstract
Large-scale foundation models provide powerful feature representations for downstream object segmentation tasks. However, when adapted to specific tasks through the full-parameter fine-tuning, the enormous parameters being updated often results in significant computational overhead, creating a bottleneck in training efficiency. Although existing methods attempt to fine-tune frozen models by directly embedding trainable prompts, these prompts lack inherent semantic priors, limiting the adaptability of large-scale models. In this paper, we propose a novel dynamic priors-based fine-tuning paradigm with fewer trainable parameters, dubbed Controllable-LPMoE, which adaptively modulates frozen foundation models by dynamically controlling local priors to enhance fine-grained perception for specific segmentation tasks. More specifically, we construct a lightweight dynamic mixed local priors extractor that captures diverse local priors from input images through heterogeneous convolutions while employing a gating network to dynamically output expert priors required for the subsequent fine-tuning. Furthermore, we design a bi-directional interaction adapter that employs cosine-aligned deformable attention and channel-oriented adaptive scale enhancement to interact and restructure between frozen and trainable features, achieving efficient fine-tuning. Extensive experiments validate the superiority of our href{https://github.com/CSYSI/Controllable-LPMoE} {Controllable-LPMoE} approach, demonstrating excellent segmentation performance compared to 31 state-of-the-art (SOTA) methods and adaptability to multiple binary object segmentation tasks.