🤖 AI Summary
This work addresses a critical limitation in existing optimization-based multi-task learning (MTL) approaches: when employed with advanced optimizers such as Muon, their performance is hindered because instantaneous gradients contribute minimally to parameter updates, thereby underutilizing the optimizer’s learning dynamics. The study reveals that gradients are systematically undervalued in such optimizers and further uncovers an inherent, implicit MTL capability within Muon itself. To bridge this gap, the authors propose the Adaptive Parameter Tuning (APT) framework, which integrates an adaptive momentum mechanism to harmonize the interaction between the optimizer and MTL objectives, alongside a lightweight direction-preserving strategy to enhance Muon’s orthogonalization capacity. Extensive experiments across four mainstream MTL benchmarks demonstrate that APT consistently and significantly improves performance, offering robust gains over existing methods.
📝 Abstract
Multi-Task Learning (MTL) is a foundational machine learning problem that has seen extensive development over the past decade. Recently, various optimization-based MTL approaches have been proposed to learn multiple tasks simultaneously by altering the optimization trajectory. Although these methods strive to de-conflict and re-balance tasks, we empirically identify that their effectiveness is often undermined by an overlooked factor when employing advanced optimizers: the instant-derived gradients play only a marginal role in the actual parameter updates. This discrepancy prevents MTL frameworks from fully releasing its power on learning dynamics. Furthermore, we observe that Muon-a recently emerged advanced optimizer-inherently functions as a multi-task learner, which underscores the critical importance of the gradients used for its orthogonalization. To address these issues, we propose APT (Applicability of advanced oPTimizers), a framework featuring a simple adaptive momentum mechanism designed to balance the strengths between advanced optimizers and MTL. Additionally, we introduce a light direction preservation method to facilitate Muon's orthogonalization. Extensive experiments across four mainstream MTL datasets demonstrate that APT consistently augments existing MTL approaches, yielding substantial performance improvements.