LLMServingSim2.0: A Unified Simulator for Heterogeneous Hardware and Serving Techniques in LLM Infrastructure

📅 2025-11-10

🏛️ IEEE computer architecture letters

📈 Citations: 0

✨ Influential: 0

career value

243K/year

🤖 AI Summary

Existing LLM service simulators suffer from two key limitations: insufficient hardware abstraction and narrow coverage of serving techniques. To address these, we propose LLM-Sim, a unified simulation platform built upon a three-layer modular architecture—comprising a hardware abstraction layer, a service policy layer, and a trace-driven modeling layer—that enables co-modeling of heterogeneous accelerators (e.g., GPUs, TPUs) and diverse serving techniques (e.g., request routing, caching, scheduling). LLM-Sim introduces operator-level latency analysis and a plug-and-play accelerator interface, facilitating one-click integration of novel hardware. Evaluation demonstrates that LLM-Sim reduces TPU case implementation code by 18.5× and achieves only 1.9% end-to-end latency error on GPU-based serving. It significantly outperforms prior simulators in both simulation efficiency and fidelity, establishing a high-fidelity, scalable evaluation infrastructure for LLM system design.

Technology Category

Application Category

📝 Abstract

This paper introduces LLMServingSim2.0, a system simulator designed for exploring heterogeneous hardware in large-scale LLM serving systems. LLMServingSim2.0 addresses two key limitations of its predecessor: (1) integrating hardware models into system-level simulators is non-trivial due to the lack of a clear abstraction, and (2) existing simulators support only a narrow subset of serving techniques, leaving no infrastructure that captures the breadth of approaches in modern LLM serving. To overcome these issues, LLMServingSim2.0 adopts trace-driven performance modeling, accompanied by an operator-level latency profiler, enabling the integration of new accelerators with a single command. It further embeds up-to-date serving techniques while exposing flexible interfaces for request routing, cache management, and scheduling policies. In a TPU case study, our profiler requires 18.5x fewer LoC and outperforms the predecessor's hardware-simulator integration, demonstrating LLMServingSim2.0's low-effort hardware extensibility. Our experiments further show that LLMServingSim2.0 reproduces GPU-based LLM serving with 1.9% error, while maintaining practical simulation time, making it a comprehensive platform for both hardware developers and LLM service providers.

Problem

Research questions and friction points this paper is trying to address.

Simulating heterogeneous hardware integration in LLM serving systems

Supporting diverse serving techniques beyond narrow existing approaches

Providing flexible infrastructure for modern LLM serving configurations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Trace-driven performance modeling for hardware integration

Operator-level latency profiler for accelerator compatibility

Flexible interfaces for serving techniques and policies

🔎 Similar Papers

CloudNativeSim: a toolkit for modeling and simulation of cloud-native applications