🤖 AI Summary
Existing LLM compression methods overemphasize perplexity or performance on simple tasks, neglecting higher-order capabilities—such as retrieval-augmented generation (RAG), multi-step reasoning, external tool invocation, and computational expressiveness. To address this, we propose the “Lottery LLM Hypothesis”: a smaller submodel, when augmented with RAG, tool-assisted execution, and semantic-aware KV cache compression, can match the original model’s performance on complex tasks. We thus introduce a capability-oriented compression framework that systematically identifies, preserves, and optimizes for these four critical capabilities. Our approach incorporates multi-step reasoning modeling, tool interface adaptation, and semantic-aware KV cache compression. Experiments demonstrate that the compressed models significantly outperform baselines on RAG and tool-integrated tasks. This work establishes a new theoretical foundation, capability-aware evaluation criteria, and practical technical pathways for application-driven, efficient LLM compression.
📝 Abstract
Motivated by reducing the computational and storage costs of LLMs, model compression and KV cache compression have attracted much attention from researchers. However, current methods predominantly emphasize maintaining the performance of compressed LLMs, as measured by perplexity or simple accuracy on tasks of common sense knowledge QA and basic arithmetic reasoning. In this blog, we present a brief review of recent advancements in LLMs related to retrieval-augmented generation, multi-step reasoning, external tools, and computational expressivity, all of which substantially enhance LLM performance. Then, we propose a lottery LLM hypothesis suggesting that for a given LLM and task, there exists a smaller lottery LLM capable of producing the same performance as the original LLM with the assistance of multi-step reasoning and external tools. Based on the review of current progress in LLMs, we discuss and summarize the essential capabilities that the lottery LLM and KV cache compression must possess, which are currently overlooked in existing methods.