🤖 AI Summary
This work addresses the lack of a unified evaluation framework for large language model compression, which has predominantly focused on knowledge-intensive tasks while neglecting critical capabilities such as reasoning, multilingual performance, and instruction following. To bridge this gap, we propose UniComp, a comprehensive benchmark that systematically evaluates three mainstream compression techniques—pruning, quantization, and knowledge distillation—across more than 40 diverse benchmarks along three dimensions: performance, reliability, and efficiency. Our analysis reveals capability shifts induced by compression and introduces task-specific calibration strategies that improve reasoning performance of pruned models by up to 50%. Experimental results demonstrate that quantization achieves the best trade-off between performance and efficiency, whereas distillation offers significant speedup at the cost of higher computational overhead.
📝 Abstract
Model compression is increasingly essential for deploying large language models (LLMs), yet existing evaluations are limited in method coverage and focus primarily on knowledge-centric benchmarks. Thus, we introduce UniComp, a unified evaluation framework for comparing pruning, quantization, and knowledge distillation. UniComp evaluates compressed models along three dimensions: performance, reliability, and efficiency, using a diverse set of capability- and safety-oriented benchmarks together with a hardware-aware efficiency analysis. Through extensive evaluation of six compression techniques on modern LLMs across more than 40 datasets, we find that (i) compression exhibits a consistent knowledge bias, where knowledge-intensive tasks are relatively preserved while reasoning, multilingual, and instruction-following capabilities degrade substantially; (ii) quantization provides the best overall trade-off between retained performance and efficiency, whereas distillation yields strong runtime acceleration gains at high computational cost; and (iii) task-specific calibration can significantly improve the reasoning ability of pruned models by up to 50%.