CKA-Guided Modular Quantization: Beyond Bit-Width to Algorithmic Diversity

📅 2025-12-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing post-training quantization methods for large language models (LLMs) typically adopt uniform, layer-wise strategies, overlooking intrinsic differences in layer sensitivity to quantization algorithms. Method: We propose a fine-tuning-free, modular heterogeneous quantization framework that introduces Linear Centered Kernel Alignment (Linear CKA) as a layer-level metric to automatically select the optimal quantization algorithm per layer—departing from conventional uniform or bit-width–mixed paradigms. The framework enables parallel evaluation of multiple quantization algorithms and modular integration of heterogeneous strategies. Contribution/Results: Evaluated on mainstream LLMs including LLaMA and Qwen, our approach achieves significantly lower perplexity (PPL) and consistently outperforms both uniform quantization and state-of-the-art mixed-precision methods across diverse downstream tasks, demonstrating the efficacy of algorithm-diversity–driven heterogeneous compression.

Technology Category

Application Category

📝 Abstract
Current mainstream post-training quantization methods for large language models typically apply a uniform quantization strategy across all network layers, overlooking the substantial differences in algorithmic suitability among layers. To address this limitation, we propose CKA Guided Modular Quantization, a fine-tuning-free, plug-and-play framework for algorithmic heterogeneous quantization. Our method independently evaluates multiple PTQ algorithms on each layer and employs Linear Centered Kernel Alignment (CKA) as a metric to automatically select the optimal quantization strategy per layer. The individually optimized strategies are then integrated to construct a hybrid quantized model. Experiments demonstrate that our approach consistently outperforms both uniform quantization baselines and state-of-the-art mixed-precision methods across mainstream LLMs including LLaMA and Qwen ,in terms of perplexity (PPL) and downstream task performance.
Problem

Research questions and friction points this paper is trying to address.

Optimizes quantization strategies per layer for LLMs
Selects best algorithm using CKA metric automatically
Improves model performance over uniform quantization methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

CKA-guided modular quantization for algorithmic diversity
Layer-wise PTQ algorithm evaluation with CKA metric
Hybrid quantized model construction from optimized strategies
🔎 Similar Papers
No similar papers found.