Can Compressed LLMs Truly Act? An Empirical Evaluation of Agentic Capabilities in LLM Compression

๐Ÿ“… 2025-05-26
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing LLM compression evaluations focus solely on language modeling and understanding tasks, overlooking the impact of compression on agent capabilitiesโ€”such as workflow generation, tool invocation, long-context reasoning, and real-world application performance. This paper introduces ACBench, the first benchmark explicitly designed to assess LLM compression effects on agent-level competencies, systematically evaluating four core agent task categories. We innovatively formalize and quantify the *non-uniformity* of compression effects, proposing three novel analytical metrics: ERank, Top-k Ranking Correlation, and Energy. Empirical evaluation spans 15 mainstream models (Gemma-2B to Qwen2.5-32B and DeepSeek-R1-Distill), applying GPTQ/AWQ quantization and Wanda/SparseGPT pruning. Results show that 4-bit quantization degrades workflow and tool-use performance by only 1โ€“3%, yet reduces real-world accuracy by 10โ€“15%. ACBench is publicly released to enable reproducible, evidence-based compression strategy selection.

Technology Category

Application Category

๐Ÿ“ Abstract
Post-training compression reduces the computational and memory costs of large language models (LLMs), enabling resource-efficient deployment. However, existing compression benchmarks only focus on language modeling (e.g., perplexity) and natural language understanding tasks (e.g., GLUE accuracy), ignoring the agentic capabilities - workflow, tool use/function call, long-context understanding and real-world application. We introduce the Agent Compression Benchmark (ACBench), the first comprehensive benchmark for evaluating how compression impacts LLMs' agentic abilities. ACBench spans (1) 12 tasks across 4 capabilities (e.g., WorfBench for workflow generation, Needle-in-Haystack for long-context retrieval), (2) quantization (GPTQ, AWQ) and pruning (Wanda, SparseGPT), and (3) 15 models, including small (Gemma-2B), standard (Qwen2.5 7B-32B), and distilled reasoning LLMs (DeepSeek-R1-Distill). Our experiments reveal compression tradeoffs: 4-bit quantization preserves workflow generation and tool use (1%-3% drop) but degrades real-world application accuracy by 10%-15%. We introduce ERank, Top-k Ranking Correlation and Energy to systematize analysis. ACBench provides actionable insights for optimizing LLM compression in agentic scenarios. The code can be found in https://github.com/pprp/ACBench.
Problem

Research questions and friction points this paper is trying to address.

Evaluates impact of compression on LLM agentic capabilities
Introduces Agent Compression Benchmark (ACBench) for comprehensive testing
Reveals tradeoffs in quantization and pruning for agentic tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Agent Compression Benchmark (ACBench)
Evaluates quantization and pruning techniques
Proposes ERank for systematic analysis