NuMuon: Nuclear-Norm-Constrained Muon for Compressible LLM Training

📅 2026-03-03

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This work addresses the challenge of compressing large language models, which is constrained by memory limitations and deployment costs. It reveals for the first time that models trained with the MuOn optimizer inherently exhibit strong low-rank characteristics. Building on this insight, the authors introduce a nuclear norm constraint to explicitly steer weight updates toward low-rank structures during training. The proposed method seamlessly integrates with mainstream compression pipelines and significantly improves post-compression performance on billion-parameter-scale models, while preserving MuOn’s original advantage of rapid convergence.

Technology Category

Application Category

📝 Abstract

The rapid progress of large language models (LLMs) is increasingly constrained by memory and deployment costs, motivating compression methods for practical deployment. Many state-of-the-art compression pipelines leverage the low-rank structure of trained weight matrices, a phenomenon often associated with the properties of popular optimizers such as Adam. In this context, Muon is a recently proposed optimizer that improves LLM pretraining via full-rank update steps, but its induced weight-space structure has not been characterized yet. In this work, we report a surprising empirical finding: despite imposing full-rank updates, Muon-trained models exhibit pronounced low-rank structure in their weight matrices and are readily compressible under standard pipelines. Motivated by this insight, we propose NuMuon, which augments Muon with a nuclear-norm constraint on the update direction, further constraining the learned weights toward low-rank structure. Across billion-parameter-scale models, we show that NuMuon increases weight compressibility and improves post-compression model quality under state-of-the-art LLM compression pipelines while retaining Muon's favorable convergence behavior.

Problem

Research questions and friction points this paper is trying to address.

large language models

model compression

low-rank structure

optimizer

nuclear norm

Innovation

Methods, ideas, or system contributions that make the work stand out.

nuclear-norm constraint

low-rank structure

compressible LLM training