AIConfigurator: Lightning-Fast Configuration Optimization for Multi-Framework LLM Serving

📅 2026-01-09
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high cost of cross-framework inference optimization for large language models (LLMs) in production, where dynamic workloads, stringent latency and throughput requirements, and vast configuration spaces pose significant challenges. The authors propose a unified performance modeling system that decomposes LLM inference into fundamental operations—such as GEMM, attention, communication, and memory—and integrates a kernel-level performance database with an abstract scheduling model to enable rapid, GPU-measurement-free configuration search across frameworks. For the first time, the system supports automatic optimization across mainstream serving frameworks including TRT-LLM, vLLM, and SGLang, covering full-stack configurations from cluster topology to engine parameters. Experiments show the system completes optimization in under 30 seconds on average, achieving up to 40% performance gains on dense models like Qwen3-32B and up to 50% on MoE models such as DeepSeek-V3.

Technology Category

Application Category

📝 Abstract
Optimizing Large Language Model (LLM) inference in production systems is increasingly difficult due to dynamic workloads, stringent latency/throughput targets, and a rapidly expanding configuration space. This complexity spans not only distributed parallelism strategies (tensor/pipeline/expert) but also intricate framework-specific runtime parameters such as those concerning the enablement of CUDA graphs, available KV-cache memory fractions, and maximum token capacity, which drastically impact performance. The diversity of modern inference frameworks (e.g., TRT-LLM, vLLM, SGLang), each employing distinct kernels and execution policies, makes manual tuning both framework-specific and computationally prohibitive. We present AIConfigurator, a unified performance-modeling system that enables rapid, framework-agnostic inference configuration search without requiring GPU-based profiling. AIConfigurator combines (1) a methodology that decomposes inference into analytically modelable primitives - GEMM, attention, communication, and memory operations while capturing framework-specific scheduling dynamics; (2) a calibrated kernel-level performance database for these primitives across a wide range of hardware platforms and popular open-weights models (GPT-OSS, Qwen, DeepSeek, LLama, Mistral); and (3) an abstraction layer that automatically resolves optimal launch parameters for the target backend, seamlessly integrating into production-grade orchestration systems. Evaluation on production LLM serving workloads demonstrates that AIConfigurator identifies superior serving configurations that improve performance by up to 40% for dense models (e.g., Qwen3-32B) and 50% for MoE architectures (e.g., DeepSeek-V3), while completing searches within 30 seconds on average. Enabling the rapid exploration of vast design spaces - from cluster topology down to engine specific flags.
Problem

Research questions and friction points this paper is trying to address.

LLM inference optimization
configuration space
multi-framework serving
latency constraints
runtime parameters
Innovation

Methods, ideas, or system contributions that make the work stand out.

framework-agnostic optimization
performance modeling
LLM inference
configuration search
kernel-level profiling
🔎 Similar Papers
No similar papers found.
T
Tianhao Xu
NVIDIA
Y
Yiming Liu
NVIDIA
Xianglong Lu
Xianglong Lu
NVIDIA
Data Center GPUsLLM inferenceCamera/LiDAR perceptionSLAMRobotics
Y
Yijia Zhao
NVIDIA
X
Xuting Zhou
NVIDIA
A
Aichen Feng
NVIDIA
Yiyi Chen
Yiyi Chen
PhD Candidate, Aalborg University
Machine LearningDeep LearningNatural Language Processing
Y
Yi Shen
NVIDIA
Qin Zhou
Qin Zhou
East China University of Science and Technology
computer visionmedical image analysisfederated learningmulti-modal learning
X
Xumeng Chen
NVIDIA
I
Ilya Sherstyuk
NVIDIA
Haorui Li
Haorui Li
Shanghai Jiaotong University
Communication theory
R
Rishi Thakkar
NVIDIA
B
Ben Hamm
NVIDIA
Y
Yuanzhe Li
NVIDIA
X
Xue Huang
NVIDIA
W
Wenpeng Wu
NVIDIA
A
Anish Shanbhag
NVIDIA
H
Harry Kim
NVIDIA
Chuan Chen
Chuan Chen
University of Wisconsin, Madison
Applied Microeconomics
J
Junjie Lai
NVIDIA