🤖 AI Summary
To address implicit forgetting and prompt memory explosion—two critical bottlenecks in prompt-based continual learning (CL) under task-agnostic inference—this paper proposes GRID, a novel framework for efficient lifelong adaptation of large language models. Methodologically, GRID introduces (1) a task-aware decoding mechanism that dynamically aligns with task semantics during inference to mitigate forgetting, and (2) a gradient-driven prompt selection and aggregation strategy that jointly leverages input-guided task identification, constrained decoding, and gradient similarity evaluation to automatically select informative prompts and compress redundant parameters. Empirically, GRID achieves substantial improvements in backward transfer across multiple benchmarks—reducing forgotten tasks by 80%—while preserving strong forward transfer. Crucially, prompt memory scales sublinearly with the number of tasks, enabling the first scalable, memory-efficient paradigm for lifelong self-adaptation of LLMs.
📝 Abstract
Prompt-based continual learning (CL) offers a parameter-efficient way to adapt large language models (LLMs) across task sequences. However, most existing methods assume task-aware inference and maintain a growing list of task-specific prompts, which limits scalability and hides latent forgetting. In this work, we introduce GRID, a unified framework that addresses two key limitations: (1) latent forgetting under task-agnostic inference, and (2) prompt memory explosion as task sequences grow. GRID integrates a task-aware decoding mechanism that improves backward transfer by leveraging representative inputs, automatic task identification, and constrained decoding. Additionally, we propose a gradient-based prompt selection strategy that compresses less informative prompts into a single aggregated representation, enabling scalable and memory-efficient lifelong learning. Extensive experiments across short-sequence, long-sequence, and negative transfer benchmarks show that GRID significantly improves backward transfer, achieves competitive forward transfer, and reduces forgotten tasks by up to 80%, outperforming state-of-the-art methods on T5 and Flan-T5 backbones.