QuickSilver -- Speeding up LLM Inference through Dynamic Token Halting, KV Skipping, Contextual Token Fusion, and Adaptive Matryoshka Quantization

📅 2025-06-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address high latency and energy consumption in autoregressive inference of large language models (LLMs), this paper proposes a modular, training-free inference acceleration framework that preserves the original model architecture. The method introduces four semantic-aware, dynamic optimization techniques: dynamic token termination, KV cache skipping, context token fusion, and adaptive Matryoshka quantization—collectively enabling computation reduction across all inference stages. Fully compliant with standard decoding protocols, it requires no auxiliary networks or architectural modifications. Experiments on GPT-2 and Llama-2 demonstrate up to 39.6% FLOPs reduction while maintaining perplexity increase ≤0.2, achieving substantial inference speedup and energy efficiency with negligible accuracy degradation.

Technology Category

Application Category

📝 Abstract
Inference accounts for the majority of latency and energy consumption in large language model (LLM) deployments, often exceeding 90% of total cost. While training-time efficiency has seen extensive progress, runtime optimization remains a key bottleneck, particularly under autoregressive decoding. Existing approaches -- such as pruning, quantization, early exits, and speculative decoding -- often require retraining, architectural changes, or disrupt decoding compatibility. We introduce QuickSilver, a modular, token-level framework that enables semantic adaptivity at inference time without altering model weights or structure. QuickSilver integrates four synergistic mechanisms: (i) Dynamic Token Halting, which halts computation for tokens with converged representations; (ii) KV Cache Skipping, which selectively suppresses memory writes to reduce attention overhead; and (iii) Contextual Token Fusion, which collapses redundant tokens into shared paths to shrink sequence length. Unlike speculative decoding or MoE routing, QuickSilver operates entirely on frozen, dense models and requires no auxiliary networks. Applied to GPT-2 and Llama-2 across WikiText-103 and C4, QuickSilver achieves up to 39.6% FLOP reduction with negligible perplexity degradation (<=0.2).
Problem

Research questions and friction points this paper is trying to address.

Reduces LLM inference latency and energy consumption
Optimizes runtime without altering model weights
Improves efficiency via token-level dynamic mechanisms
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Token Halting for converged tokens
KV Cache Skipping to reduce attention overhead
Contextual Token Fusion for redundant tokens
🔎 Similar Papers
No similar papers found.