OverFill: Two-Stage Models for Efficient Language Model Decoding

📅 2025-08-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address high decoding latency in large language model (LLM) inference caused by memory bottlenecks, this work proposes a two-stage decoupled architecture: the prefill stage employs a large model to ensure high-fidelity context understanding, while the decoding stage dynamically switches to a lightweight dense small model to accelerate token generation. The method integrates structured model pruning, stage-adaptive model switching, and runtime computational resource reallocation. Experiments on 3B→1B and 8B→3B configurations demonstrate 83.2% and 79.2% latency reductions over pure 1B/3B baselines, respectively—matching the performance of comparably sized models trained from scratch while requiring significantly less training data. The core contribution is the first realization of model-size decoupling between prefill and decoding stages, effectively breaking the decoding latency bottleneck without compromising output quality.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) excel across diverse tasks but face significant deployment challenges due to high inference costs. LLM inference comprises prefill (compute-bound) and decode (memory-bound) stages, with decode dominating latency particularly for long sequences. Current decoder-only models handle both stages uniformly, despite their distinct computational profiles. We propose OverFill, which decouples these stages to optimize accuracy-efficiency tradeoffs. OverFill begins with a full model for prefill, processing system and user inputs in parallel. It then switches to a dense pruned model, while generating tokens sequentially. Leveraging more compute during prefill, OverFill improves generation quality with minimal latency overhead. Our 3B-to-1B OverFill configuration outperforms 1B pruned models by 83.2%, while the 8B-to-3B configuration improves over 3B pruned models by 79.2% on average across standard benchmarks. OverFill matches the performance of same-sized models trained from scratch, while using significantly less training data. Our code is available at https://github.com/friendshipkim/overfill.
Problem

Research questions and friction points this paper is trying to address.

Optimizes LLM inference cost and latency
Decouples prefill and decode stages for efficiency
Improves generation quality with minimal overhead
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decouples prefill and decode stages
Uses dense pruned model for decoding
Improves generation quality with minimal latency
🔎 Similar Papers
No similar papers found.