Flatness and Gradient Alignment Are Both Necessary: Spectral-Aware Gradient-Aligned Exploration for Multi-Distribution Learning

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work addresses the limitations of existing domain generalization and multi-task learning methods, which typically optimize either loss landscape flatness or gradient alignment but struggle to balance both for improved generalization. The study theoretically establishes, for the first time, the distinct and non-interchangeable contributions of flatness and gradient alignment to generalization risk. Building on this insight, the authors propose SAGE, an algorithm that leverages Newton–Schulz iterations to efficiently compute the polar factor of the gradient matrix, enabling spectral-aware curvature exploration. SAGE further injects isotropic noise scaled proportionally to cross-distribution gradient disagreement during parameter updates, thereby jointly optimizing flatness and gradient alignment. Experiments demonstrate that SAGE achieves new state-of-the-art performance across five domain generalization benchmarks—including DomainBed—and two multi-task learning benchmarks, significantly enhancing the performance of base MTL solvers.

📝 Abstract

Sharpness-aware and gradient-alignment methods have been shown to improve generalization, however each family of methods targets a single geometric property of the loss landscape, while ignoring the other. In this paper, we show that this omission is structurally unavoidable and that both flatness and gradient alignment should be considered in multi-distribution learning settings. Specifically, we derive an excess-risk decomposition that yields two additive leading-order terms: (i) an alignment term, controlled by the trace of $\bar{H}^{-1}Σ_g$ and (ii) a curvature term, controlled by $\bar{H}$, where $\bar{H}$ is the average Hessian and $Σ_g$ is the covariance of the gradient across distributions. Notably, $\bar{H}$ appears inverted in one and non-inverted in the other. We further show, via a counterexample, that neither quantity bounds the other in general, so no algorithm targeting only one term can guarantee low excess risk. Motivated by this decomposition, we propose SAGE (Spectral-Aware Gradient-Aligned Exploration) that targets both terms. The curvature component replaces SAM's gradient-scaled perturbation with the polar factor of each layer's gradient matrix, computed via Newton-Schulz iteration, so that the ascent step probes all directions with similar magnitude. On the other hand, the alignment component injects isotropic noise at the descent step, the magnitude of which scales with cross-distribution gradient disagreement. Experiments on five domain-generalization and two multi-task learning benchmarks show that the proposed method establishes a new state-of-the-art on DomainBed and acts as a general-purpose improvement to base MTL solvers, remaining competitive with, or even surpassing, state-of-the-art methods.

Problem

Research questions and friction points this paper is trying to address.

flatness

gradient alignment

multi-distribution learning

excess risk

loss landscape

Innovation

Methods, ideas, or system contributions that make the work stand out.

gradient alignment

loss landscape flatness

multi-distribution learning