🤖 AI Summary
This work addresses the optimization instability in multimodal large language models during autoregressive training, which arises from gradient heterogeneity across modalities and hinders scaling to large batch sizes. The authors propose ML-FOP-SOAP, a novel framework that introduces the second-order preconditioning method SOAP into multimodal training for the first time. By integrating Fisher orthogonal projection to mitigate inter-modal competition and incorporating multi-level variance correction with hierarchical folding strategies, the approach enables efficient co-optimization of visual generation and textual understanding under low computational overhead. Evaluated on Janus and Emu3, ML-FOP-SOAP significantly enhances training stability and sample efficiency, supporting stable training with batch sizes up to 8192. It achieves a 1.4× improvement in sample efficiency and reduces training time by 1.5× compared to AdamW.
📝 Abstract
Autoregressive next-token training offers a unified formulation for image generation and text understanding, but it also creates strong modality competition that destabilizes optimization and limits large-batch scaling. We show that first-order optimizers such as AdamW are vulnerable to cross-modality gradient heterogeneity, while second-order preconditioning, particularly SOAP, provides a more stable basis for multimodal alignment. Building on this insight, we propose \emph{ML-FOP-SOAP}, a second-order optimization framework with Multi-Level Variance Correction. Our Fisher-Orthogonal Projection suppresses variance-induced modality conflicts, reducing the trade-off between visual generation and textual understanding. To make this practical under large gradient accumulation, we introduce a hierarchical folding strategy that captures fine-grained variance with low micro-step overhead. Experiments on Janus and Emu3 show consistent gains across both modalities and stable training at batch size 8192. Compared with AdamW, our method improves sample efficiency by up to $1.4\times$ and accelerates wall-clock training by up to $1.5\times$, offering a robust optimizer for scaling multimodal foundation models.