ELLA: Efficient Lifelong Learning for Adapters in Large Language Models

📅 2026-01-05

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work addresses catastrophic forgetting in large language models during continual learning, a challenge exacerbated by existing approaches that either rely on data replay—compromising privacy—or enforce rigid orthogonality constraints that deplete representational capacity. To overcome these limitations, the authors propose a replay-free, architecture-agnostic training framework with constant memory overhead. The method introduces a lightweight regularizer that applies anisotropic shrinkage to the aggregated update matrix, selectively suppressing alignment with high-energy directions associated with past tasks while preserving low-energy subspaces to facilitate forward transfer. This approach achieves the first zero-shot continual learning paradigm that balances stability and scalability, attaining state-of-the-art performance across three mainstream benchmarks with up to a 9.6% relative accuracy gain, a 35-fold reduction in memory usage, and significantly improved generalization to unseen tasks.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) suffer severe catastrophic forgetting when adapted sequentially to new tasks in a continual learning (CL) setting. Existing approaches are fundamentally limited: replay-based methods are impractical and privacy-violating, while strict orthogonality-based methods collapse under scale: each new task is projected onto an orthogonal complement, progressively reducing the residual degrees of freedom and eliminating forward transfer by forbidding overlap in shared representations. In this work, we introduce ELLA, a training framework built on the principle of selective subspace de-correlation. Rather than forbidding all overlap, ELLA explicitly characterizes the structure of past updates and penalizes alignments along their high-energy, task-specific directions, while preserving freedom in the low-energy residual subspaces to enable transfer. Formally, this is realized via a lightweight regularizer on a single aggregated update matrix. We prove this mechanism corresponds to an anisotropic shrinkage operator that bounds interference, yielding a penalty that is both memory- and compute-constant regardless of task sequence length. ELLA requires no data replay, no architectural expansion, and negligible storage. Empirically, it achieves state-of-the-art CL performance on three popular benchmarks, with relative accuracy gains of up to $9.6\%$ and a $35\times$ smaller memory footprint. Further, ELLA scales robustly across architectures and actively enhances the model's zero-shot generalization performance on unseen tasks, establishing a principled and scalable solution for constructive lifelong LLM adaptation.

Problem

Research questions and friction points this paper is trying to address.

catastrophic forgetting

continual learning

large language models

lifelong learning

task adaptation

Innovation

Methods, ideas, or system contributions that make the work stand out.

lifelong learning

catastrophic forgetting

selective subspace decorrelation