UniComp: A Unified Evaluation of Large Language Model Compression via Pruning, Quantization and Distillation

📅 2026-02-09

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This work addresses the lack of a unified evaluation framework for large language model compression, which has predominantly focused on knowledge-intensive tasks while neglecting critical capabilities such as reasoning, multilingual performance, and instruction following. To bridge this gap, we propose UniComp, a comprehensive benchmark that systematically evaluates three mainstream compression techniques—pruning, quantization, and knowledge distillation—across more than 40 diverse benchmarks along three dimensions: performance, reliability, and efficiency. Our analysis reveals capability shifts induced by compression and introduces task-specific calibration strategies that improve reasoning performance of pruned models by up to 50%. Experimental results demonstrate that quantization achieves the best trade-off between performance and efficiency, whereas distillation offers significant speedup at the cost of higher computational overhead.

Technology Category

Application Category

📝 Abstract

Model compression is increasingly essential for deploying large language models (LLMs), yet existing evaluations are limited in method coverage and focus primarily on knowledge-centric benchmarks. Thus, we introduce UniComp, a unified evaluation framework for comparing pruning, quantization, and knowledge distillation. UniComp evaluates compressed models along three dimensions: performance, reliability, and efficiency, using a diverse set of capability- and safety-oriented benchmarks together with a hardware-aware efficiency analysis. Through extensive evaluation of six compression techniques on modern LLMs across more than 40 datasets, we find that (i) compression exhibits a consistent knowledge bias, where knowledge-intensive tasks are relatively preserved while reasoning, multilingual, and instruction-following capabilities degrade substantially; (ii) quantization provides the best overall trade-off between retained performance and efficiency, whereas distillation yields strong runtime acceleration gains at high computational cost; and (iii) task-specific calibration can significantly improve the reasoning ability of pruned models by up to 50%.

Problem

Research questions and friction points this paper is trying to address.

large language model compression

pruning

quantization

knowledge distillation

evaluation framework

Innovation

Methods, ideas, or system contributions that make the work stand out.

model compression

unified evaluation

pruning