🤖 AI Summary
This work addresses the limitations of gradient descent and conditional gradient methods (Frank–Wolfe) in non-Euclidean optimization. The proposed generalized framework unifies their strengths through three key contributions: (1) the first $(L_0, L_1)$-smoothness theory, which rigorously characterizes the coupled effects of gradients and curvature in non-Euclidean spaces; (2) a generalized gradient norm clipping mechanism that guarantees strict descent under this smoothness condition; and (3) a principled integration of weight decay with the Frank–Wolfe short-step rule, augmented by momentum-based gradient estimation, achieving the optimal $O(n^{-1/4})$ convergence rate for stochastic optimization. Empirically, the method significantly improves training stability and generalization performance on image classification and language modeling benchmarks.
📝 Abstract
This work introduces a hybrid non-Euclidean optimization method which generalizes gradient norm clipping by combining steepest descent and conditional gradient approaches. The method achieves the best of both worlds by establishing a descent property under a generalized notion of ($L_0$,$L_1$)-smoothness. Weight decay is incorporated in a principled manner by identifying a connection to the Frank-Wolfe short step. In the stochastic case, we show an order optimal $O(n^{-1/4})$ convergence rate by leveraging a momentum based gradient estimator. We discuss how to instantiate the algorithms for deep learning and demonstrate their properties on image classification and language modeling.