🤖 AI Summary
This work addresses a practical bottleneck in multi-objective optimization (MOO): neural networks (e.g., LSTMs, Transformers) commonly violate standard L-smoothness or bounded-gradient assumptions. We study MOO under generalized ℓ-smooth losses. First, we establish the first complete convergence theory for the Multi-Gradient Descent Algorithm (MGDA) and its stochastic variant under this generalized smoothness: deterministic MGDA converges to an ε-Pareto-stationary point with sample complexity O(ε⁻²), while stochastic MGDA achieves O(ε⁻⁴), both ensuring conflict-avoiding (CA) direction errors of O(ε) per iteration. Second, we propose MGDA-FA, which retains these theoretical guarantees with only O(1) time and space overhead. Key technical innovations include generalized smoothness analysis, multi-objective gradient synthesis and CA-direction approximation, double-sampling stochastic optimization, and precise characterization of Pareto stationarity.
📝 Abstract
Multi-objective optimization (MOO) is receiving more attention in various fields such as multi-task learning. Recent works provide some effective algorithms with theoretical analysis but they are limited by the standard $L$-smooth or bounded-gradient assumptions, which typically do not hold for neural networks, such as Long short-term memory (LSTM) models and Transformers. In this paper, we study a more general and realistic class of generalized $ell$-smooth loss functions, where $ell$ is a general non-decreasing function of gradient norm. We revisit and analyze the fundamental multiple gradient descent algorithm (MGDA) and its stochastic version with double sampling for solving the generalized $ell$-smooth MOO problems, which approximate the conflict-avoidant (CA) direction that maximizes the minimum improvement among objectives. We provide a comprehensive convergence analysis of these algorithms and show that they converge to an $epsilon$-accurate Pareto stationary point with a guaranteed $epsilon$-level average CA distance (i.e., the gap between the updating direction and the CA direction) over all iterations, where totally $mathcal{O}(epsilon^{-2})$ and $mathcal{O}(epsilon^{-4})$ samples are needed for deterministic and stochastic settings, respectively. We prove that they can also guarantee a tighter $epsilon$-level CA distance in each iteration using more samples. Moreover, we analyze an efficient variant of MGDA named MGDA-FA using only $mathcal{O}(1)$ time and space, while achieving the same performance guarantee as MGDA.