Can Compressed LLMs Truly Act? An Empirical Evaluation of Agentic Capabilities in LLM Compression

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Existing LLM compression evaluations focus solely on language modeling and understanding tasks, overlooking the impact of compression on agent capabilities—such as workflow generation, tool invocation, long-context reasoning, and real-world application performance. This paper introduces ACBench, the first benchmark explicitly designed to assess LLM compression effects on agent-level competencies, systematically evaluating four core agent task categories. We innovatively formalize and quantify the *non-uniformity* of compression effects, proposing three novel analytical metrics: ERank, Top-k Ranking Correlation, and Energy. Empirical evaluation spans 15 mainstream models (Gemma-2B to Qwen2.5-32B and DeepSeek-R1-Distill), applying GPTQ/AWQ quantization and Wanda/SparseGPT pruning. Results show that 4-bit quantization degrades workflow and tool-use performance by only 1–3%, yet reduces real-world accuracy by 10–15%. ACBench is publicly released to enable reproducible, evidence-based compression strategy selection.

Technology Category

Application Category

📝 Abstract

Post-training compression reduces the computational and memory costs of large language models (LLMs), enabling resource-efficient deployment. However, existing compression benchmarks only focus on language modeling (e.g., perplexity) and natural language understanding tasks (e.g., GLUE accuracy), ignoring the agentic capabilities - workflow, tool use/function call, long-context understanding and real-world application. We introduce the Agent Compression Benchmark (ACBench), the first comprehensive benchmark for evaluating how compression impacts LLMs' agentic abilities. ACBench spans (1) 12 tasks across 4 capabilities (e.g., WorfBench for workflow generation, Needle-in-Haystack for long-context retrieval), (2) quantization (GPTQ, AWQ) and pruning (Wanda, SparseGPT), and (3) 15 models, including small (Gemma-2B), standard (Qwen2.5 7B-32B), and distilled reasoning LLMs (DeepSeek-R1-Distill). Our experiments reveal compression tradeoffs: 4-bit quantization preserves workflow generation and tool use (1%-3% drop) but degrades real-world application accuracy by 10%-15%. We introduce ERank, Top-k Ranking Correlation and Energy to systematize analysis. ACBench provides actionable insights for optimizing LLM compression in agentic scenarios. The code can be found in https://github.com/pprp/ACBench.

Problem

Research questions and friction points this paper is trying to address.

Evaluates impact of compression on LLM agentic capabilities

Introduces Agent Compression Benchmark (ACBench) for comprehensive testing

Reveals tradeoffs in quantization and pruning for agentic tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Agent Compression Benchmark (ACBench)

Evaluates quantization and pruning techniques

Proposes ERank for systematic analysis

🔎 Similar Papers

Position IDs Matter: An Enhanced Position Layout for Efficient Context Compression in Large Language Models

2024-09-22Citations: 0

💼 Related Jobs

Machine Learning Engineer - Agentic AI

Apple

Sunnyvale, United States of America

Authors to Follow