Regulating Branch Parallelism in LLM Serving

📅 2026-05-07
📈 Citations: 0
Influential: 0
📄 PDF

career value

243K/year
🤖 AI Summary
Existing large language models struggle to balance throughput and latency in intra-request speculative decoding: aggressive strategies increase shared decoding latency, while conservative ones limit throughput gains. This work proposes TAPER, which introduces the novel concept of “branch externality” and designs a step-wise branch admission controller that treats additional branches as opportunistic tasks, admitting them only when predicted externality remains within the current batch’s slack budget. By decoupling computation and memory through branch-level scheduling, TAPER shares prefix KV caches and enables elastic width scaling without memory reclamation. It further incorporates context length, batch composition, and accumulated slack into an online prediction mechanism. Evaluated on Qwen3-32B, TAPER achieves 1.77× and 1.48× higher effective throughput over IRP-Off and IRP-Eager, respectively, while maintaining over 95% service-level objective compliance.
📝 Abstract
Recent methods expose intra-request parallelism in LLM outputs, allowing independent branches to decode concurrently. Existing serving systems execute these branches eagerly or under fixed caps. We show that both are brittle: eager admission inflates the shared decode step, degrading co-batched requests in serial stages, while conservative fixed caps forgo the throughput that motivated exposing branches in the first place. We call the excess step latency caused by admitted branches the branch externality and show that the safe width depends on batch composition, context lengths, and accumulated slack, all of which change continuously over a workload trace. We introduce TAPER, a per-step admission controller that treats extra branches as opportunistic work, admitted only when the predicted branch externality fits within the batch's current slack budget. Per-step regulation is practical because branch-level scheduling decouples compute from memory: branches share the request's prefix KV, so expanding or contracting width requires no memory reclamation. On Qwen3-32B, TAPER improves goodput by $1.77\times$ over IRP-Off and by $1.48\times$ over IRP-Eager, while maintaining over $95\%$ SLO attainment.
Problem

Research questions and friction points this paper is trying to address.

Branch Parallelism
LLM Serving
Decode Latency
Batch Scheduling
Throughput Optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

branch parallelism
LLM serving
admission control
slack budget
KV cache sharing
🔎 Similar Papers
No similar papers found.