ExPLoRA: Parameter-Efficient Extended Pre-Training to Adapt Vision Transformers under Domain Shifts

📅 2024-06-16
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of label scarcity and parameter inefficiency in cross-domain transfer of Vision Transformers (ViTs), this paper proposes an unsupervised, parameter-efficient extended pretraining paradigm. The method integrates Low-Rank Adaptation (LoRA) with partial unfreezing of only 1–2 Transformer blocks, while retaining established self-supervised objectives—either contrastive learning (DINOv2) or masked image modeling (MAE)—for unsupervised in-domain pretraining on target domains (e.g., satellite imagery). Crucially, no downstream supervision is required. Experimental results demonstrate substantial gains in linear probe performance: top-1 accuracy improves by up to 8% over full-parameter fine-tuning, achieving state-of-the-art performance with fewer than 10% trainable parameters. This work establishes a novel lightweight and adaptive paradigm for deploying large vision models across diverse domains without labeled data.

Technology Category

Application Category

📝 Abstract
Parameter-efficient fine-tuning (PEFT) techniques such as low-rank adaptation (LoRA) can effectively adapt large pre-trained foundation models to downstream tasks using only a small fraction (0.1%-10%) of the original trainable weights. An under-explored question of PEFT is in extending the pre-training phase without supervised labels; that is, can we adapt a pre-trained foundation model to a new domain via efficient self-supervised pre-training on this new domain? In this work, we introduce ExPLoRA, a highly effective technique to improve transfer learning of pre-trained vision transformers (ViTs) under domain shifts. Initializing a ViT with pre-trained weights on large, natural-image datasets such as from DinoV2 or MAE, ExPLoRA continues the unsupervised pre-training objective on a new domain, unfreezing 1-2 pre-trained ViT blocks and tuning all other layers with LoRA. We then fine-tune the resulting model only with LoRA on this new domain for supervised learning. Our experiments demonstrate state-of-the-art results on satellite imagery, even outperforming fully pre-training and fine-tuning ViTs. Using the DinoV2 training objective, we demonstrate up to 8% improvement in linear probing top-1 accuracy on downstream tasks while using<10% of the number of parameters that are used in prior fully-tuned state-of-the art approaches. Our ablation studies confirm the efficacy of our approach over other baselines, including PEFT and unfreezing more ViT blocks. Code is available on the project website: https://samar-khanna.github.io/ExPLoRA/
Problem

Research questions and friction points this paper is trying to address.

Adapt pre-trained vision transformers
Efficient self-supervised pre-training
Improve transfer learning under domain shifts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends pre-training via self-supervision
Uses LoRA for parameter-efficient fine-tuning
Unfreezes 1-2 ViT blocks for adaptation
🔎 Similar Papers
No similar papers found.