NAS-LoRA: Empowering Parameter-Efficient Fine-Tuning for Visual Foundation Models with Searchable Adaptation

📅 2025-12-03

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Vision Transformers (ViTs) such as SAM often struggle to capture high-level semantic features in downstream domains (e.g., medical imaging, agriculture) due to the lack of intra-patch spatial priors in their encoders. Method: We propose NAS-LoRA—a novel parameter-efficient fine-tuning (PEFT) framework that integrates neural architecture search (NAS) for the first time into PEFT to dynamically discover and inject spatial priors into adapter layers; it further employs a staged optimization strategy to enhance LoRA’s capacity for learning semantic representations. Contribution/Results: NAS-LoRA introduces no inference overhead while significantly outperforming existing PEFT methods across multiple downstream tasks. It reduces training cost by 24.14% and empirically validates the effectiveness and generalizability of NAS-driven spatial prior modeling for adapting ViTs to domain-specific semantics.

Technology Category

Application Category

📝 Abstract

The Segment Anything Model (SAM) has emerged as a powerful visual foundation model for image segmentation. However, adapting SAM to specific downstream tasks, such as medical and agricultural imaging, remains a significant challenge. To address this, Low-Rank Adaptation (LoRA) and its variants have been widely employed to enhancing SAM's adaptation performance on diverse domains. Despite advancements, a critical question arises: can we integrate inductive bias into the model? This is particularly relevant since the Transformer encoder in SAM inherently lacks spatial priors within image patches, potentially hindering the acquisition of high-level semantic information. In this paper, we propose NAS-LoRA, a new Parameter-Efficient Fine-Tuning (PEFT) method designed to bridge the semantic gap between pre-trained SAM and specialized domains. Specifically, NAS-LoRA incorporates a lightweight Neural Architecture Search (NAS) block between the encoder and decoder components of LoRA to dynamically optimize the prior knowledge integrated into weight updates. Furthermore, we propose a stage-wise optimization strategy to help the ViT encoder balance weight updates and architectural adjustments, facilitating the gradual learning of high-level semantic information. Various Experiments demonstrate our NAS-LoRA improves existing PEFT methods, while reducing training cost by 24.14% without increasing inference cost, highlighting the potential of NAS in enhancing PEFT for visual foundation models.

Problem

Research questions and friction points this paper is trying to address.

Adapt SAM to specialized domains like medical imaging

Incorporate spatial priors into SAM's Transformer encoder lacking them

Bridge semantic gap between pre-trained SAM and downstream tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates Neural Architecture Search into LoRA adaptation

Uses stage-wise optimization for balanced weight updates

Reduces training cost without increasing inference cost

🔎 Similar Papers

Parameter-Efficient Fine-Tuning for Pre-Trained Vision Models: A Survey