ReLU soothes the NTK condition number and accelerates optimization for wide neural networks

📅 2023-05-15

🏛️ arXiv.org

📈 Citations: 9

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This work investigates how nonlinear activation functions—particularly ReLU—affect the condition number of the Neural Tangent Kernel (NTK) in wide neural networks. Method: Integrating NTK theory, random initialization analysis, and geometric characterization of the gradient feature space, we analytically characterize how ReLU reshapes the geometry of gradients. Contribution/Results: We establish, for the first time, a theoretical link between ReLU activation and improved NTK conditioning: ReLU significantly enhances angular separation among gradients of similar samples, thereby reducing the NTK condition number. Moreover, increasing network depth strictly lowers the condition number—contrary to the classical result that linear networks exhibit depth-invariant, constant NTK condition numbers. We rigorously prove that deep, wide ReLU networks achieve superior data separability and smaller NTK condition numbers compared to their linear or shallow counterparts, providing a novel geometric explanation and theoretical foundation for accelerated gradient descent convergence.

📝 Abstract

Rectified linear unit (ReLU), as a non-linear activation function, is well known to improve the expressivity of neural networks such that any continuous function can be approximated to arbitrary precision by a sufficiently wide neural network. In this work, we present another interesting and important feature of ReLU activation function. We show that ReLU leads to: {it better separation} for similar data, and {it better conditioning} of neural tangent kernel (NTK), which are closely related. Comparing with linear neural networks, we show that a ReLU activated wide neural network at random initialization has a larger angle separation for similar data in the feature space of model gradient, and has a smaller condition number for NTK. Note that, for a linear neural network, the data separation and NTK condition number always remain the same as in the case of a linear model. Furthermore, we show that a deeper ReLU network (i.e., with more ReLU activation operations), has a smaller NTK condition number than a shallower one. Our results imply that ReLU activation, as well as the depth of ReLU network, helps improve the gradient descent convergence rate, which is closely related to the NTK condition number.

Problem

Research questions and friction points this paper is trying to address.

ReLU activation improves feature separation in wide neural networks

Nonlinear activation enhances NTK conditioning for better convergence

Network depth amplifies these beneficial effects of nonlinear activations

Innovation

Methods, ideas, or system contributions that make the work stand out.

ReLU activation improves NTK conditioning in wide networks

Nonlinear activation enhances feature separation in gradient space

Network depth amplifies these beneficial effects of activation

🔎 Similar Papers

No similar papers found.