🤖 AI Summary
Existing approaches to accelerating long-context prefilling predominantly rely on sparse attention mechanisms, which struggle to adapt to hybrid attention architectures—such as linear/full or sliding-window/full attention—and lack support for continuous batching, thereby limiting their applicability in modern inference engines. To address these limitations, this work proposes UniPrefill, a general-purpose prefilling acceleration framework that directly accelerates arbitrary attention architectures at the token level through block-level dynamic sparsification. UniPrefill introduces highly efficient operators that seamlessly integrate into vLLM and support continuous batching, coordinated prefill-decode scheduling, and tensor parallelism. Experimental results demonstrate that UniPrefill achieves up to a 2.1× speedup in Time-To-First-Token, with greater acceleration gains observed under higher concurrency.
📝 Abstract
As large language models (LLMs) continue to advance rapidly, they are becoming increasingly capable while simultaneously demanding ever-longer context lengths. To improve the inference efficiency of long-context processing, several novel low-complexity hybrid architectures have recently been proposed, effectively alleviating the computational burden of long-context inference. However, existing research on long-context prefill acceleration remains predominantly focused on sparse attention mechanisms, which achieve their maximum speedup only on full-attention models. When transferred to emerging architectures--such as linear/full attention hybrids or sliding window/full attention hybrids--these prefill acceleration approaches suffer significant performance degradation. Furthermore, such methods are generally incompatible with continuous batching, making them difficult to integrate into modern inference engines such as vLLM. To this end, we propose UniPrefill, a prefill acceleration framework applicable to virtually any model architecture, which directly accelerates the model's computation at the token level. We further implement UniPrefill as a continuous batching operator and extend vLLM's scheduling strategy to natively support prefill-decode co-processing and tensor parallel for UniPrefill, enabling its seamless integration into vLLM. UniPrefill achieves up to 2.1x speedup in Time-To-First-Token (TTFT), with the acceleration becoming increasingly pronounced as the number of concurrent requests grows.