🤖 AI Summary
In streaming continual learning, the substantial impact of temporal task partitioning on evaluation outcomes has long been overlooked, leading to unreliable benchmark conclusions. This work addresses this gap by treating temporal partitioning as a first-order variable in evaluation and introduces an analytical framework based on plasticity–stability profiles. It further proposes two metrics—profile distance and Boundary Profile Sensitivity (BPS)—to reveal the structural influence of task partitioning itself on learning dynamics. Experiments under a fixed budget on the CESNET-Timeseries24 network traffic forecasting task demonstrate significant differences across 9-day, 30-day, and 44-day partitions in terms of prediction error, forgetting, and backward transfer. Shorter partitions induce higher noise levels, greater structural profile distances, and elevated BPS, collectively demonstrating that evaluation results are highly sensitive to the choice of temporal partitioning.
📝 Abstract
Streaming Continual Learning (CL) typically converts a continuous stream into a sequence of discrete tasks through temporal partitioning. We argue that this temporal taskification step is not a neutral preprocessing choice, but a structural component of evaluation: different valid splits of the same stream can induce different CL regimes and therefore different benchmark conclusions. To study this effect, we introduce a taskification-level framework based on plasticity and stability profiles, a profile distance between taskifications, and Boundary-Profile Sensitivity (BPS), which diagnoses how strongly small boundary perturbations alter the induced regime before any CL model is trained. We evaluate continual finetuning, Experience Replay, Elastic Weight Consolidation, and Learning without Forgetting on network traffic forecasting with CESNET-Timeseries24, keeping the stream, model, and training budget fixed while varying only the temporal taskification. Across 9-, 30-, and 44-day splits, we observe substantial changes in forecasting error, forgetting, and backward transfer, showing that taskification alone can materially affect CL evaluation. We further find that shorter taskifications induce noisier distribution-level patterns, larger structural distances, and higher BPS, indicating greater sensitivity to boundary perturbations. These results show that benchmark conclusions in streaming CL depend not only on the learner and the data stream, but also on how that stream is taskified, motivating temporal taskification as a first-class evaluation variable.