ADEPT: Continual Pretraining via Adaptive Expansion and Dynamic Decoupled Tuning

📅 2025-10-11

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

To address catastrophic forgetting and limited domain capacity in large language models (LLMs) during continual pretraining (CPT), this paper proposes an Adaptive Expansion and Dynamic Decoupled Tuning framework. Methodologically, it introduces a novel functionality-aware hierarchical selective expansion mechanism, integrated with unit-level importance-aware decoupled optimization and asymmetric learning rate scheduling, enabling synergistic modeling of general capability retention and domain-specific knowledge injection. Its key innovation lies in functionally decoupling parameter expansion from parameter updating—thereby eliminating their entanglement. Experiments demonstrate that tuning only 15% of parameters reduces training time by over 50%, while outperforming full-parameter fine-tuning on mathematical and medical benchmarks: general capability improves by 5.76% and domain-specific performance by 5.58%.

Technology Category

Application Category

📝 Abstract

Conventional continual pretraining (CPT) for large language model (LLM) domain adaptation often suffers from catastrophic forgetting and limited domain capacity. Existing strategies adopt layer expansion, introducing additional trainable parameters to accommodate new knowledge. However, the uniform expansion and updates still entangle general and domain learning, undermining its effectiveness. Our pilot studies reveal that LLMs exhibit functional specialization, where layers and units differentially encode general-critical capabilities, suggesting that parameter expansion and optimization should be function-aware. We then propose ADEPT, Adaptive Expansion and Dynamic Decoupled Tuning for continual pretraining, a two-stage framework for domain-adaptive CPT. ADEPT first performs General-Competence Guided Selective Layer Expansion, duplicating layers least critical for the general domain to increase representational capacity while minimizing interference with general knowledge. It then applies Adaptive Unit-Wise Decoupled Tuning, disentangling parameter units within expanded layers according to their general-domain importance and assigning asymmetric learning rates to balance knowledge injection and retention. Experiments on mathematical and medical benchmarks show that ADEPT outperforms full-parameter CPT by up to 5.76% on the general domain and 5.58% on the target domain with only 15% of parameters tuned and less than 50% training time. Ablation studies, theoretical analysis, and extended investigations further demonstrate the necessity of targeted expansion and decoupled optimization, providing new principles for efficient and robust domain-adaptive CPT. Our code is open-sourced at https://github.com/PuppyKnightUniversity/ADEPT

Problem

Research questions and friction points this paper is trying to address.

Addresses catastrophic forgetting in continual pretraining of large language models

Solves limited domain capacity through adaptive layer expansion strategies

Separates general and domain learning via decoupled parameter optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Selective layer expansion for domain capacity

Unit-wise decoupled tuning for knowledge balance

Asymmetric learning rates for retention and injection

🔎 Similar Papers

HiDe-PET: Continual Learning via Hierarchical Decomposition of Parameter-Efficient Tuning