Fast Heterogeneous Serving: Scalable Mixed-Scale LLM Allocation for SLO-Constrained Inference

📅 2026-04-08

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work addresses the joint optimization challenge of model selection, resource allocation, and parallelism configuration for multi-scale large language model inference on large-scale heterogeneous GPU clusters under multidimensional constraints including latency, accuracy, and budget. The authors propose two constraint-aware heuristic algorithms: a Greedy Heuristic (GH) and an Adaptive Greedy Heuristic (AGH), the latter integrating multi-start construction, relocation-based local search, and GPU consolidation strategies. Key innovations include tensor-parallelism (TP)-aware feasibility filtering, cost-per-effective-coverage ranking, and a TP-upgrading mechanism, which collectively enhance solution quality and computational efficiency while strictly satisfying constraints. Experiments on real-world Azure traces demonstrate that AGH generates near-optimal feasible solutions within one second—over 260× faster than mixed-integer linear programming—and maintains controlled SLO violations and stable costs even under 1.5× parameter inflation pressure.

Technology Category

Application Category

📝 Abstract

Deploying large language model (LLM) inference at scale requires jointly selecting base models, provisioning heterogeneous GPUs, configuring parallelism, and distributing workloads under tight latency, accuracy, and budget constraints. Exact mixed-integer linear programming (MILP) approaches guarantee optimality but scale poorly. We propose two constraint-aware heuristics: a Greedy Heuristic (GH) for single-pass allocation, and an Adaptive Greedy Heuristic (AGH) that enhances GH via multi-start construction, relocate-based local search, and GPU consolidation. Three constraint-aware mechanisms -- TP-aware feasibility selection, cost-per-effective-coverage ranking, and TP upgrade -- ensure feasibility under tightly coupled memory, delay, error, and budget constraints. On workloads calibrated with the Azure LLM Inference Trace (2025), both heuristics produce feasible solutions in under one second, with AGH closely approaching optimal cost while achieving over 260x speedup on large-scale instances. Under out-of-sample stress tests with up to 1.5x parameter inflation, AGH maintains controlled SLO violations and stable cost, whereas the exact solver's placement degrades sharply.

Problem

Research questions and friction points this paper is trying to address.

Large Language Model

Heterogeneous Serving

SLO-Constrained Inference

Resource Allocation

Scalable Inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

heterogeneous serving

LLM inference

constraint-aware heuristics