SCALE: Upscaled Continual Learning of Large Language Models

📅 2025-11-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address catastrophic forgetting in continual pretraining of large language models (LLMs), this paper proposes a structure-aware width expansion paradigm: freezing the original model parameters while selectively widening only the linear layers of feed-forward networks (FFNs), augmented with lightweight, plug-and-play expansion modules. Methodologically, the approach adheres to two principles—Persistent Preservation (retaining original functionality) and Collaborative Adaptation (enabling joint assimilation of new knowledge)—implemented via fidelity-preserving initialization, selective parameter training, and SCALE-Route, a token-level dynamic routing mechanism. Crucially, it leaves residual connections and attention mechanisms unaltered. Experiments on synthetic biographical data and Korean corpora demonstrate substantial mitigation of forgetting during continual learning. The method achieves an optimal stability–plasticity trade-off, preserving English-language capabilities while significantly enhancing Korean performance.

Technology Category

Application Category

📝 Abstract
We revisit continual pre-training for large language models and argue that progress now depends more on scaling the right structure than on scaling parameters alone. We introduce SCALE, a width upscaling architecture that inserts lightweight expansion into linear modules while freezing all pre-trained parameters. This preserves the residual and attention topologies and increases capacity without perturbing the base model's original functionality. SCALE is guided by two principles: Persistent Preservation, which maintains the base model's behavior via preservation-oriented initialization and freezing of the pre-trained weights, and Collaborative Adaptation, which selectively trains a subset of expansion components to acquire new knowledge with minimal interference. We instantiate these ideas as SCALE-Preserve (preservation-first), SCALE-Adapt (adaptation-first), and SCALE-Route, an optional routing extension that performs token-level routing between preservation and adaptation heads. On a controlled synthetic biography benchmark, SCALE mitigates the severe forgetting observed with depth expansion while still acquiring new knowledge. In continual pre-training on a Korean corpus, SCALE variants achieve less forgetting on English evaluations and competitive gains on Korean benchmarks, with these variants offering the best overall stability-plasticity trade-off. Accompanying analysis clarifies when preservation provably holds and why the interplay between preservation and adaptation stabilizes optimization compared to standard continual learning setups.
Problem

Research questions and friction points this paper is trying to address.

Mitigating catastrophic forgetting in continual pre-training of large language models
Increasing model capacity without disrupting original pre-trained functionality
Achieving better stability-plasticity trade-off during knowledge acquisition
Innovation

Methods, ideas, or system contributions that make the work stand out.

Width upscaling architecture with lightweight expansion modules
Preserves pre-trained parameters while increasing model capacity
Selectively trains expansion components for new knowledge acquisition
🔎 Similar Papers
No similar papers found.
J
Jin-woo Lee
Gen.AI Core Lab, Samsung SDS
J
Junhwa Choi
Gen.AI Core Lab, Samsung SDS
B
Bongkyu Hwang
Gen.AI Core Lab, Samsung SDS
J
Jinho Choo
Gen.AI Core Lab, Samsung SDS
B
Bogun Kim
Gen.AI Core Lab, Samsung SDS
J
JeongSeon Yi
Gen.AI Core Lab, Samsung SDS
Joonseok Lee
Joonseok Lee
Google Research, Seoul National University
Machine LearningComputer VisionVideo UnderstandingRecommendation SystemsCollaborative Filtering
D
DongYoung Jung
Gen.AI Core Lab, Samsung SDS
J
Jaeseon Park
Gen.AI Core Lab, Samsung SDS
K
Kyoungwon Park
Gen.AI Core Lab, Samsung SDS
S
Suk-hoon Jung
Gen.AI Core Lab, Samsung SDS