🤖 AI Summary
This study addresses a critical oversight in existing continual learning evaluations: the common practice of fixing fine-tuning mechanisms while ignoring the substantial impact of the trainable parameter subspace on performance assessment. The work formalizes fine-tuning as projection-based optimization within a fixed subspace and, for the first time, systematically investigates how varying depths of trainable parameters influence task fitting and knowledge retention. Through extensive experiments across five benchmark datasets and eleven task orders, combining four mainstream methods (EWC, LwF, SI, GEM) with five trainable-depth configurations, the authors demonstrate that method rankings shift significantly with the choice of fine-tuning mechanism. Deeper adaptation induces larger parameter updates and higher forgetting rates, with a stronger correlation between the two. The findings underscore that trainable depth should be treated as an explicit experimental variable in continual learning evaluation protocols.
📝 Abstract
Continual learning (CL) studies how models acquire tasks sequentially while retaining previously learned knowledge. Despite substantial progress in benchmarking CL methods, comparative evaluations typically keep the fine-tuning regime fixed. In this paper, we argue that the fine-tuning regime, defined by the trainable parameter subspace, is itself a key evaluation variable. We formalize adaptation regimes as projected optimization over fixed trainable subspaces, showing that changing the trainable depth alters the effective update signal through which both current task fitting and knowledge preservation operate. This analysis motivates the hypothesis that method comparisons need not be invariant across regimes. We test this hypothesis in task incremental CL, five trainable depth regimes, and four standard methods: online EWC, LwF, SI, and GEM. Across five benchmark datasets, namely MNIST, Fashion MNIST, KMNIST, QMNIST, and CIFAR-100, and across 11 task orders per dataset, we find that the relative ranking of methods is not consistently preserved across regimes. We further show that deeper adaptation regimes are associated with larger update magnitudes, higher forgetting, and a stronger relationship between the two. These results show that comparative conclusions in CL can depend strongly on the chosen fine-tuning regime, motivating regime-aware evaluation protocols that treat trainable depth as an explicit experimental factor.