๐ค AI Summary
Existing LLM compression evaluations focus solely on language modeling and understanding tasks, overlooking the impact of compression on agent capabilitiesโsuch as workflow generation, tool invocation, long-context reasoning, and real-world application performance. This paper introduces ACBench, the first benchmark explicitly designed to assess LLM compression effects on agent-level competencies, systematically evaluating four core agent task categories. We innovatively formalize and quantify the *non-uniformity* of compression effects, proposing three novel analytical metrics: ERank, Top-k Ranking Correlation, and Energy. Empirical evaluation spans 15 mainstream models (Gemma-2B to Qwen2.5-32B and DeepSeek-R1-Distill), applying GPTQ/AWQ quantization and Wanda/SparseGPT pruning. Results show that 4-bit quantization degrades workflow and tool-use performance by only 1โ3%, yet reduces real-world accuracy by 10โ15%. ACBench is publicly released to enable reproducible, evidence-based compression strategy selection.
๐ Abstract
Post-training compression reduces the computational and memory costs of large language models (LLMs), enabling resource-efficient deployment. However, existing compression benchmarks only focus on language modeling (e.g., perplexity) and natural language understanding tasks (e.g., GLUE accuracy), ignoring the agentic capabilities - workflow, tool use/function call, long-context understanding and real-world application. We introduce the Agent Compression Benchmark (ACBench), the first comprehensive benchmark for evaluating how compression impacts LLMs' agentic abilities. ACBench spans (1) 12 tasks across 4 capabilities (e.g., WorfBench for workflow generation, Needle-in-Haystack for long-context retrieval), (2) quantization (GPTQ, AWQ) and pruning (Wanda, SparseGPT), and (3) 15 models, including small (Gemma-2B), standard (Qwen2.5 7B-32B), and distilled reasoning LLMs (DeepSeek-R1-Distill). Our experiments reveal compression tradeoffs: 4-bit quantization preserves workflow generation and tool use (1%-3% drop) but degrades real-world application accuracy by 10%-15%. We introduce ERank, Top-k Ranking Correlation and Energy to systematize analysis. ACBench provides actionable insights for optimizing LLM compression in agentic scenarios. The code can be found in https://github.com/pprp/ACBench.