Implicit Bias of Gradient Descent for Non-Homogeneous Deep Networks

📅 2025-02-22

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This work addresses the theoretical challenge of characterizing the implicit bias of gradient descent (GD) in non-homogeneous deep networks, focusing on the asymptotic behavior under exponential loss when the initial empirical risk is sufficiently small. Prior implicit bias analyses were restricted to homogeneous networks; this paper extends the theory for the first time to a broad class of non-homogeneous architectures—such as those with residual connections or non-homogeneous activations—that satisfy a mild approximate homogeneity condition, thereby resolving an open problem posed by Ji & Telgarsky (2020). The analysis establishes approximate monotonicity of the normalized margin, proves convergence of the parameter direction, and verifies that the limiting direction satisfies the Karush–Kuhn–Tucker (KKT) conditions of the max-margin optimization problem. Consequently, GD iterates converge in direction to a KKT solution of this problem, rigorously revealing the asymptotic implicit bias of GD in non-homogeneous deep networks.

Technology Category

Application Category

📝 Abstract

We establish the asymptotic implicit bias of gradient descent (GD) for generic non-homogeneous deep networks under exponential loss. Specifically, we characterize three key properties of GD iterates starting from a sufficiently small empirical risk, where the threshold is determined by a measure of the network's non-homogeneity. First, we show that a normalized margin induced by the GD iterates increases nearly monotonically. Second, we prove that while the norm of the GD iterates diverges to infinity, the iterates themselves converge in direction. Finally, we establish that this directional limit satisfies the Karush-Kuhn-Tucker (KKT) conditions of a margin maximization problem. Prior works on implicit bias have focused exclusively on homogeneous networks; in contrast, our results apply to a broad class of non-homogeneous networks satisfying a mild near-homogeneity condition. In particular, our results apply to networks with residual connections and non-homogeneous activation functions, thereby resolving an open problem posed by Ji and Telgarsky (2020).

Problem

Research questions and friction points this paper is trying to address.

Analyzes gradient descent behavior in non-homogeneous deep networks

Establishes directional convergence and margin maximization properties

Extends implicit bias results to networks with residual connections

Innovation

Methods, ideas, or system contributions that make the work stand out.

Gradient Descent analysis

Non-homogeneous deep networks

Karush-Kuhn-Tucker conditions

🔎 Similar Papers

From Prejudice to Parity: A New Approach to Debiasing Large Language Model Word Embeddings