Draft-based Approximate Inference for LLMs

📅 2025-06-10

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Transformer-based large language models (LLMs) suffer from quadratic computational complexity and linear memory overhead in long-context inference. Method: This paper proposes the first approximate inference framework leveraging a lightweight draft model, departing from conventional lossless speculative decoding paradigms. It introduces two novel mechanisms: SpecKV, which dynamically prunes the key-value (KV) cache based on draft-model outputs, and SpecPC, which identifies and compresses low-importance prompt tokens via draft-model attention patterns. The approach jointly exploits draft–target model collaboration, attention pattern transfer analysis, and token importance modeling to enable sparse approximate inference. Contribution/Results: Experiments on long-context benchmarks demonstrate that our method achieves significantly higher accuracy than existing approximate inference techniques, while matching their memory savings, latency reduction, and throughput improvement.

Technology Category

Application Category

📝 Abstract

Optimizing inference for long-context Large Language Models (LLMs) is increasingly important due to the quadratic compute and linear memory complexity of Transformers. Existing approximation methods, such as key-value (KV) cache dropping, sparse attention, and prompt compression, typically rely on rough predictions of token or KV pair importance. We propose a novel framework for approximate LLM inference that leverages small draft models to more accurately predict the importance of tokens and KV pairs. Specifically, we introduce two instantiations of our proposed framework: (i) SpecKV, which leverages a draft output to accurately assess the importance of each KV pair for more effective KV cache dropping, and (ii) SpecPC, which uses the draft model's attention activations to identify and discard unimportant prompt tokens. To the best of our knowledge, this is the first work to use draft models for approximate LLM inference acceleration, extending their utility beyond traditional lossless speculative decoding. We motivate our methods with theoretical and empirical analyses, and show a strong correlation between the attention patterns of draft and target models. Extensive experiments on long-context benchmarks show that our methods consistently achieve higher accuracy than existing baselines, while preserving the same improvements in memory usage, latency, and throughput. Our code is available at https://github.com/furiosa-ai/draft-based-approx-llm.

Problem

Research questions and friction points this paper is trying to address.

Optimizing inference for long-context LLMs with quadratic compute complexity

Accurately predicting token and KV pair importance using draft models

Improving KV cache dropping and prompt compression via draft-based methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses draft models for importance prediction

Introduces SpecKV for KV cache optimization

Presents SpecPC for prompt token reduction

🔎 Similar Papers

No similar papers found.