QuantMoE-Bench: Examining Post-Training Quantization for Mixture-of-Experts

📅 2024-06-12
📈 Citations: 2
Influential: 1
📄 PDF
🤖 AI Summary
Fixed-precision quantization in post-training quantization of Mixture-of-Experts (MoE) models leads to suboptimal performance due to heterogeneous sensitivity across model substructures. Method: This paper proposes a structure-aware, fine-grained mixed-precision quantization method. Departing from conventional uniform quantization, it systematically characterizes the distinct bit-width requirements of MoE subcomponents—e.g., linear layers and MoE blocks—leveraging empirical observations: shared experts exhibit high activation frequency, while sparse experts activate selectively. To this end, we design an outlier-aware linear-layer scorer and an MoE-block importance predictor, enabling data-driven, layer-wise precision allocation. Results: Evaluated on two representative MoE architectures across six NLU and commonsense reasoning benchmarks, our method achieves an average accuracy of 65.35%, significantly outperforming the GPTQ baseline (64.30%) and establishing new state-of-the-art quantized performance.

Technology Category

Application Category

📝 Abstract
Mixture-of-Experts (MoE) is a promising way to scale up the learning capacity of large language models. It increases the number of parameters while keeping FLOPs nearly constant during inference through sparse activation. Yet, it still suffers from significant memory overheads due to the vast parameter size, necessitating model compression techniques. Post-training quantization offers a powerful approach for model compression. Existing methods adopt a fixed quantization precision for the entire MoE model. This rigid setup can lead to suboptimal performance, without considering the inherent sparse structure. For example, MoE's sparse routing mechanism leads to different activation patterns, where shared experts are accessed by all tokens while token-conditioned experts are selectively activated. This activation disparity suggests different quantization requirements, with consistently activated shared experts potentially needing higher precision to maintain model quality. In this paper, we study a fine-grained precision setup for MoE quantization. We explore MoE structure-aware quantization heuristics, ranging from coarse (e.g., MoE layers) to fine granularity (e.g., linear layers). Our investigations reveal critical principles, where different MoE structures require varying numbers of bits for effective quantization. Conclusions are supported by extensive benchmarking across two representative MoE models and six tasks including commonsense reasoning and natural language understanding. We further show that an MoE quantized in a fined-grained mixed precision achieved state-of-the-art 65.35% performance on average compared to the baseline 64.30% (i.e., GPTQ). Moreover, based on the findings, we introduce novel data-driven techniques for optimizing bit allocation in MoE quantization, including the outlier-aware linear layer scorer and MoE block importance predictor.
Problem

Research questions and friction points this paper is trying to address.

Optimizing quantization precision for Mixture-of-Experts models
Addressing memory overheads in large language models
Developing structure-aware quantization heuristics for MoE
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained MoE quantization precision
Structure-aware quantization heuristics
Data-driven bit allocation techniques
🔎 Similar Papers
No similar papers found.