🤖 AI Summary
Existing LLM service simulators suffer from two key limitations: insufficient hardware abstraction and narrow coverage of serving techniques. To address these, we propose LLM-Sim, a unified simulation platform built upon a three-layer modular architecture—comprising a hardware abstraction layer, a service policy layer, and a trace-driven modeling layer—that enables co-modeling of heterogeneous accelerators (e.g., GPUs, TPUs) and diverse serving techniques (e.g., request routing, caching, scheduling). LLM-Sim introduces operator-level latency analysis and a plug-and-play accelerator interface, facilitating one-click integration of novel hardware. Evaluation demonstrates that LLM-Sim reduces TPU case implementation code by 18.5× and achieves only 1.9% end-to-end latency error on GPU-based serving. It significantly outperforms prior simulators in both simulation efficiency and fidelity, establishing a high-fidelity, scalable evaluation infrastructure for LLM system design.
📝 Abstract
This paper introduces LLMServingSim2.0, a system simulator designed for exploring heterogeneous hardware in large-scale LLM serving systems. LLMServingSim2.0 addresses two key limitations of its predecessor: (1) integrating hardware models into system-level simulators is non-trivial due to the lack of a clear abstraction, and (2) existing simulators support only a narrow subset of serving techniques, leaving no infrastructure that captures the breadth of approaches in modern LLM serving. To overcome these issues, LLMServingSim2.0 adopts trace-driven performance modeling, accompanied by an operator-level latency profiler, enabling the integration of new accelerators with a single command. It further embeds up-to-date serving techniques while exposing flexible interfaces for request routing, cache management, and scheduling policies. In a TPU case study, our profiler requires 18.5x fewer LoC and outperforms the predecessor's hardware-simulator integration, demonstrating LLMServingSim2.0's low-effort hardware extensibility. Our experiments further show that LLMServingSim2.0 reproduces GPU-based LLM serving with 1.9% error, while maintaining practical simulation time, making it a comprehensive platform for both hardware developers and LLM service providers.