Convergence and Implicit Bias of Gradient Descent on Continual Linear Classification

📅 2025-04-17

📈 Citations: 1

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This paper investigates the directional convergence and implicit bias of gradient descent in continual learning for multi-task linear classification. It addresses how task ordering—cyclic versus random—affects convergence to the joint maximum-margin solution. Method: We conduct theoretical analysis of gradient descent dynamics under continual learning, introducing novel non-asymptotic cyclic averaging to quantify forgetting, and extending convergence analysis to jointly inseparable settings. Contributions/Results: (i) First theoretical proof that continual training globally converges to the joint maximum-margin solution—even when individual tasks implicitly converge to their respective max-margin solutions; (ii) Quantitative characterization of the relationship between task alignment, catastrophic forgetting, and backward knowledge transfer; (iii) Non-asymptotic cyclic averaging analysis showing forgetting vanishes over cycles; (iv) First rigorous extension of convergence guarantees to jointly inseparable cases, proving convergence to the unique global minimizer of the joint loss.

Technology Category

Application Category

📝 Abstract

We study continual learning on multiple linear classification tasks by sequentially running gradient descent (GD) for a fixed budget of iterations per task. When all tasks are jointly linearly separable and are presented in a cyclic/random order, we show the directional convergence of the trained linear classifier to the joint (offline) max-margin solution. This is surprising because GD training on a single task is implicitly biased towards the individual max-margin solution for the task, and the direction of the joint max-margin solution can be largely different from these individual solutions. Additionally, when tasks are given in a cyclic order, we present a non-asymptotic analysis on cycle-averaged forgetting, revealing that (1) alignment between tasks is indeed closely tied to catastrophic forgetting and backward knowledge transfer and (2) the amount of forgetting vanishes to zero as the cycle repeats. Lastly, we analyze the case where the tasks are no longer jointly separable and show that the model trained in a cyclic order converges to the unique minimum of the joint loss function.

Problem

Research questions and friction points this paper is trying to address.

Analyzes gradient descent convergence in continual linear classification tasks

Explores implicit bias towards max-margin solutions in cyclic/random task orders

Investigates forgetting and knowledge transfer in non-separable task scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sequential gradient descent for continual learning

Directional convergence to joint max-margin solution

Non-asymptotic analysis on cycle-averaged forgetting

🔎 Similar Papers

Achieving Margin Maximization Exponentially Fast via Progressive Norm Rescaling