CodeQuant: Unified Clustering and Quantization for Enhanced Outlier Smoothing in Low-Precision Mixture-of-Experts

📅 2026-04-12

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses the significant accuracy degradation in low-precision Mixture-of-Experts (MoE) models during post-training quantization, primarily caused by outliers in both activations and weights. The paper introduces the first unified framework that jointly handles outlier mitigation and clustered quantization: learnable rotation is employed to smooth activation outliers, while weight outliers are absorbed into fine-tuned cluster centroids, thereby reducing quantization error. Combined with efficient GPU/CPU-specific kernel designs, the proposed method substantially outperforms existing quantization approaches across various MoE architectures, achieving up to a 4.15× speedup with higher accuracy. This advancement enhances the practicality of low-bit MoE models without compromising their representational capacity.

Technology Category

Application Category

📝 Abstract

Outliers have emerged as a fundamental bottleneck in preserving accuracy for low-precision large models, particularly within Mixture-of-Experts (MoE) architectures that are increasingly central to large-scale language modeling. Under post-training quantization (PTQ), these outliers induce substantial quantization errors, leading to severe accuracy degradation. While recent rotation-based smoothing techniques alleviate the problem by redistributing outlier magnitudes, residual errors remain and continue to impede reliable low-precision deployment. In this work, we tackle this challenge by introducing \textit{CodeQuant}, a unified quantization-and-clustering scheme that contains smoothing activation outliers via learnable rotation and absorbing weight outliers into fine-tuned cluster centroids for MoE. This design reduces the influence of extreme values by fitting them within cluster centroids, thereby lowering quantization error while maintaining expressive capacity. Coupled with a dedicated kernel design for GPU and CPU, CodeQuant achieves up to $4.15\times$ speedup while delivering significantly higher accuracy than state-of-the-art quantization approaches across diverse MoE models. Our results highlight CodeQuant as a promising direction for efficient and accurate deployment of MoE-based large language models under low-precision constraints. Our code is available at https://github.com/SAI-Lab-NYU/CodeQuant.

Problem

Research questions and friction points this paper is trying to address.

outliers

low-precision

Mixture-of-Experts

quantization error

post-training quantization

Innovation

Methods, ideas, or system contributions that make the work stand out.

CodeQuant

Mixture-of-Experts

outlier smoothing