LLMServingSim2.0: A Unified Simulator for Heterogeneous Hardware and Serving Techniques in LLM Infrastructure

📅 2025-11-10
🏛️ IEEE computer architecture letters
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM service simulators suffer from two key limitations: insufficient hardware abstraction and narrow coverage of serving techniques. To address these, we propose LLM-Sim, a unified simulation platform built upon a three-layer modular architecture—comprising a hardware abstraction layer, a service policy layer, and a trace-driven modeling layer—that enables co-modeling of heterogeneous accelerators (e.g., GPUs, TPUs) and diverse serving techniques (e.g., request routing, caching, scheduling). LLM-Sim introduces operator-level latency analysis and a plug-and-play accelerator interface, facilitating one-click integration of novel hardware. Evaluation demonstrates that LLM-Sim reduces TPU case implementation code by 18.5× and achieves only 1.9% end-to-end latency error on GPU-based serving. It significantly outperforms prior simulators in both simulation efficiency and fidelity, establishing a high-fidelity, scalable evaluation infrastructure for LLM system design.

Technology Category

Application Category

📝 Abstract
This paper introduces LLMServingSim2.0, a system simulator designed for exploring heterogeneous hardware in large-scale LLM serving systems. LLMServingSim2.0 addresses two key limitations of its predecessor: (1) integrating hardware models into system-level simulators is non-trivial due to the lack of a clear abstraction, and (2) existing simulators support only a narrow subset of serving techniques, leaving no infrastructure that captures the breadth of approaches in modern LLM serving. To overcome these issues, LLMServingSim2.0 adopts trace-driven performance modeling, accompanied by an operator-level latency profiler, enabling the integration of new accelerators with a single command. It further embeds up-to-date serving techniques while exposing flexible interfaces for request routing, cache management, and scheduling policies. In a TPU case study, our profiler requires 18.5x fewer LoC and outperforms the predecessor's hardware-simulator integration, demonstrating LLMServingSim2.0's low-effort hardware extensibility. Our experiments further show that LLMServingSim2.0 reproduces GPU-based LLM serving with 1.9% error, while maintaining practical simulation time, making it a comprehensive platform for both hardware developers and LLM service providers.
Problem

Research questions and friction points this paper is trying to address.

Simulating heterogeneous hardware integration in LLM serving systems
Supporting diverse serving techniques beyond narrow existing approaches
Providing flexible infrastructure for modern LLM serving configurations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Trace-driven performance modeling for hardware integration
Operator-level latency profiler for accelerator compatibility
Flexible interfaces for serving techniques and policies
🔎 Similar Papers
No similar papers found.
J
Jaehong Cho
School of Computing, Korea Advanced Institute of Science and Technology, Daejeon 34141, South Korea
H
Hyunmin Choi
School of Computing, Korea Advanced Institute of Science and Technology, Daejeon 34141, South Korea
Jongse Park
Jongse Park
Associate Professor; School of Computing; KAIST
Computer ArchitectureHW/SW CodesignAI SystemsAutonomous Systems