🤖 AI Summary
This work addresses the theoretical challenge of characterizing the implicit bias of gradient descent (GD) in non-homogeneous deep networks, focusing on the asymptotic behavior under exponential loss when the initial empirical risk is sufficiently small. Prior implicit bias analyses were restricted to homogeneous networks; this paper extends the theory for the first time to a broad class of non-homogeneous architectures—such as those with residual connections or non-homogeneous activations—that satisfy a mild approximate homogeneity condition, thereby resolving an open problem posed by Ji & Telgarsky (2020). The analysis establishes approximate monotonicity of the normalized margin, proves convergence of the parameter direction, and verifies that the limiting direction satisfies the Karush–Kuhn–Tucker (KKT) conditions of the max-margin optimization problem. Consequently, GD iterates converge in direction to a KKT solution of this problem, rigorously revealing the asymptotic implicit bias of GD in non-homogeneous deep networks.
📝 Abstract
We establish the asymptotic implicit bias of gradient descent (GD) for generic non-homogeneous deep networks under exponential loss. Specifically, we characterize three key properties of GD iterates starting from a sufficiently small empirical risk, where the threshold is determined by a measure of the network's non-homogeneity. First, we show that a normalized margin induced by the GD iterates increases nearly monotonically. Second, we prove that while the norm of the GD iterates diverges to infinity, the iterates themselves converge in direction. Finally, we establish that this directional limit satisfies the Karush-Kuhn-Tucker (KKT) conditions of a margin maximization problem. Prior works on implicit bias have focused exclusively on homogeneous networks; in contrast, our results apply to a broad class of non-homogeneous networks satisfying a mild near-homogeneity condition. In particular, our results apply to networks with residual connections and non-homogeneous activation functions, thereby resolving an open problem posed by Ji and Telgarsky (2020).