Frequency Matters: Fast Model-Agnostic Data Curation for Pruning and Quantization

📅 2026-03-17

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses the inefficiency of calibration data selection in post-training compression of large language models by proposing ZipCal, a model-agnostic data filtering method that leverages the Zipfian power-law distribution without relying on model-specific signals. By analyzing word frequency distributions, ZipCal constructs calibration sets with high lexical diversity at linear computational complexity. Experimental results demonstrate that ZipCal significantly outperforms random sampling across multiple pruning and quantization benchmarks, achieving performance comparable to state-of-the-art perplexity-based methods while reducing computational overhead by an average of approximately 240×.

Technology Category

Application Category

📝 Abstract

Post-training model compression is essential for enhancing the portability of Large Language Models (LLMs) while preserving their performance. While several compression approaches have been proposed, less emphasis has been placed on selecting the most suitable set of data (the so-called \emph{calibration data}) for finding the compressed model configuration. The choice of calibration data is a critical step in preserving model capabilities both intra- and inter-tasks. In this work, we address the challenge of identifying high-performance calibration sets for both pruning and quantization by analyzing intrinsic data properties rather than model-specific signals. We introduce \texttt{\textbf{ZipCal}}, a model-agnostic data curation strategy that maximizes lexical diversity based on Zipfian power laws. Experiments demonstrate that our method consistently outperforms standard uniform random sampling across various pruning benchmarks. Notably, it also performs on par, in terms of downstream performance, with a state-of-the-art method that relies on model perplexity. The latter becomes prohibitively expensive at large-scale models and datasets, while \texttt{\textbf{ZipCal}} is on average $\sim$240$\times$ faster due to its tractable linear complexity\footnote{We make the code and the experiments available at https://anonymous.4open.science/r/zipcal-71CD/.}.

Problem

Research questions and friction points this paper is trying to address.

data curation

model compression

calibration data

pruning

quantization

Innovation

Methods, ideas, or system contributions that make the work stand out.

model-agnostic

data curation

Zipfian distribution