Dynamic Rebatching for Efficient Early-Exit Inference with DREX

📅 2025-12-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional batched inference frameworks struggle to support early-exit (EE) large language models, as requests within a batch exit asynchronously at different layers; existing approaches either forfeit EE opportunities via uniform exit decisions or degrade output quality via forced premature exits. This paper proposes a dynamic re-batching mechanism: at each layer’s exit point, requests are dynamically partitioned in real time into those satisfying exit conditions and those requiring further computation, then independently scheduled and re-batched. Key contributions include: (1) a copy-free re-batching buffer that eliminates redundant KV-cache duplication; and (2) an EE- and SLA-aware reward-predictive scheduler enabling fine-grained, quality-controllable dynamic batch management. Experiments demonstrate that our method improves throughput by 2–12% while strictly preserving generation quality, completely eliminating involuntary exits, and— for the first time—aligning EE design intent with system-level scheduling.

Technology Category

Application Category

📝 Abstract
Early-Exit (EE) is a Large Language Model (LLM) architecture that accelerates inference by allowing easier tokens to be generated using only a subset of the model's layers. However, traditional batching frameworks are ill-suited for EE LLMs, as not all requests in a batch may be ready to exit at the same time. Existing solutions either force a uniform decision on the batch, which overlooks EE opportunities, or degrade output quality by forcing premature exits. We propose Dynamic Rebatching, a solution where we dynamically reorganize the batch at each early-exit point. Requests that meet the exit criteria are immediately processed, while those that continue are held in a buffer, re-grouped into a new batch, and forwarded to deeper layers. We introduce DREX, an early-exit inference system that implements Dynamic Rebatching with two key optimizations: 1) a copy-free rebatching buffer that avoids physical data movement, and 2) an EE and SLA-aware scheduler that analytically predicts whether a given rebatching operation will be profitable. DREX also efficiently handles the missing KV cache from skipped layers using memory-efficient state-copying. Our evaluation shows that DREX improves throughput by 2-12% compared to baseline approaches while maintaining output quality. Crucially, DREX completely eliminates involuntary exits, providing a key guarantee for preserving the output quality intended by the EE model.
Problem

Research questions and friction points this paper is trying to address.

Optimizes early-exit LLM inference by dynamically reorganizing batches
Eliminates forced premature exits to preserve model output quality
Improves throughput via copy-free rebatching and predictive scheduling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic batch reorganization at early-exit points
Copy-free rebatching buffer avoiding data movement
SLA-aware scheduler predicting profitable rebatching operations
🔎 Similar Papers
No similar papers found.
X
Xuting Liu
University of Pennsylvania
D
Daniel Alexander
University of Pennsylvania
S
Siva Kesava Reddy Kakarla
Microsoft Research
Behnaz Arzani
Behnaz Arzani
Principal Researcher at Microsoft Research
Context-aware AutoML for networkingML for networkingNetworking and Systems
V
Vincent Liu
University of Pennsylvania