A Convexity-dependent Two-Phase Training Algorithm for Deep Neural Networks

📅 2025-10-29

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

To address optimization difficulties arising from the non-convexity of loss functions in deep neural network training, this paper proposes an adaptive two-stage optimization framework. In the first stage, a first-order method (Adam) rapidly escapes the initial highly non-convex region; in the second stage, it automatically switches to a second-order method (conjugate gradient) upon detecting a transition toward local convexity—signaled by gradient norm decay. The key contribution is the first systematic characterization and exploitation of the dynamic evolution of the loss landscape from globally non-convex to locally convex during training, enabling a learnable, data-driven switching criterion. Experiments across multiple benchmark datasets demonstrate substantial improvements in both convergence speed and final accuracy. The framework is broadly applicable across architectures and exhibits strong engineering practicality without requiring manual hyperparameter tuning or architectural modifications.

Technology Category

Application Category

📝 Abstract

The key task of machine learning is to minimize the loss function that measures the model fit to the training data. The numerical methods to do this efficiently depend on the properties of the loss function. The most decisive among these properties is the convexity or non-convexity of the loss function. The fact that the loss function can have, and frequently has, non-convex regions has led to a widespread commitment to non-convex methods such as Adam. However, a local minimum implies that, in some environment around it, the function is convex. In this environment, second-order minimizing methods such as the Conjugate Gradient (CG) give a guaranteed superlinear convergence. We propose a novel framework grounded in the hypothesis that loss functions in real-world tasks swap from initial non-convexity to convexity towards the optimum. This is a property we leverage to design an innovative two-phase optimization algorithm. The presented algorithm detects the swap point by observing the gradient norm dependence on the loss. In these regions, non-convex (Adam) and convex (CG) algorithms are used, respectively. Computing experiments confirm the hypothesis that this simple convexity structure is frequent enough to be practically exploited to substantially improve convergence and accuracy.

Problem

Research questions and friction points this paper is trying to address.

Proposes a two-phase training algorithm for deep neural networks

Detects transition from non-convex to convex loss function regions

Combines Adam and Conjugate Gradient methods for improved convergence

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-phase training algorithm for deep neural networks

Detects convexity swap point via gradient norm analysis

Switches between Adam and Conjugate Gradient methods

🔎 Similar Papers

Extended convexity and smoothness and their applications in deep learning