Stream2LLM: Overlap Context Streaming and Prefill for Reduced TTFT

📅 2026-03-29

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This work addresses the tension between high retrieval latency and the need for low time-to-first-token (TTFT) while preserving context completeness in large language model inference. The authors propose a streaming context processing system tailored for concurrent requests, which overlaps context retrieval with model inference to significantly reduce TTFT without compromising generation quality. The system introduces an adaptive scheduling and preemption mechanism compatible with both append-mode and update-mode operations, decoupling scheduling decisions from resource acquisition. It further leverages longest common prefix matching to minimize redundant computation under dynamic inputs. Built upon a decoupled prefill-decode architecture, a hardware-aware cost model, and efficient strategies for streaming context loading and cache invalidation, the approach achieves up to 11× lower TTFT under realistic large-scale streaming workloads while maintaining throughput comparable to non-streaming baselines even under memory constraints.

Technology Category

Application Category

📝 Abstract

Context retrieval systems for LLM inference face a critical challenge: high retrieval latency creates a fundamental tension between waiting for complete context (poor time-to-first-token) and proceeding without it (reduced quality). Recent work mitigates this via streaming--overlapping retrieval with inference--but prior systems focus on single-request settings and overlook challenges in multi-tenant deployments where concurrent requests contend for GPU memory and scheduling must adapt to dynamic context arrivals. We present STREAM2LLM, a system that extends vLLM to support streaming prompts with adaptive scheduling and preemption for two distinct retrieval patterns: append-mode (progressive context accumulation) and update-mode (iterative refinement with cache invalidation). STREAM2LLM decouples scheduling decisions from resource acquisition, enabling flexible preemption strategies guided by hardware-specific cost models, and uses cache invalidation based on longest common prefix matching to minimize redundant computation when prompts change dynamically. To evaluate STREAM2LLM, we collect and characterize two large-scale, real-world streaming workloads based on web crawling and approximate nearest neighbor search. Our evaluation demonstrates that streaming architecture delivers up to 11x TTFT improvements, with cost-aware scheduling providing critical benefits under memory pressure, while maintaining throughput parity with non-streaming baselines.

Problem

Research questions and friction points this paper is trying to address.

context retrieval

time-to-first-token

streaming inference

concurrent requests

LLM serving

Innovation

Methods, ideas, or system contributions that make the work stand out.

streaming context

adaptive scheduling

prefill-decode disaggregation