π€ AI Summary
This work addresses the high energy consumption challenges in large language model (LLM) inference services, even after decoupling the prefill and decode stages, where existing scaling and dynamic voltage and frequency scaling (DVFS) techniques struggle to handle sudden load shifts and inter-stage dynamics. To this end, we propose BiScaleβthe first cross-time-scale, phase-aware joint optimization framework tailored for decoupled LLM serving. BiScale employs phase-aware placement and base frequency configuration at a coarse granularity to meet service-level objectives (SLOs) while reducing energy use, and applies model predictive control for the prefill stage and lightweight slack-aware frequency scaling for the decode stage at a fine granularity. Evaluated on a 16-node H100 cluster running Llama-3.3-70B, BiScale reduces energy consumption by 39% and 48% in the prefill and decode stages, respectively, compared to DistServe, while strictly satisfying TTFT and TPOT SLOs.
π Abstract
Prefill/decode disaggregation is increasingly adopted in LLM serving to improve the latency-throughput tradeoff and meet strict TTFT and TPOT SLOs. However, LLM inference remains energy-hungry: autoscaling alone is too coarse-grained to track fast workload fluctuations, and applying fine-grained DVFS under disaggregation is complicated by phase-asymmetric dynamics and coupling between provisioning and frequency control.
We present BiScale, a two-tier energy optimization framework for disaggregated LLM serving. BiScale jointly optimizes placement and DVFS across prefill and decode using predictive latency and power models. At coarse timescales, BiScale computes phase-aware placement and baseline frequencies that minimize energy while satisfying SLO constraints. At fine timescales, BiScale dynamically adapts GPU frequency per iteration using stage-specific control: model predictive control (MPC) for prefill to account for queue evolution and future TTFT impact, and lightweight slack-aware adaptation for decode to exploit its smoother, memory-bound dynamics. This hierarchical design enables coordinated control across timescales while preserving strict serving SLOs.
Evaluation on a 16x H100 cluster serving Llama 3.3 70B with production-style traces shows that BiScale meets TTFT/TPOT SLOs while reducing energy by up to 39% in prefill and 48% in decode relative to DistServe.