APEX: An Extensible and Dynamism-Aware Simulator for Automated Parallel Execution in LLM Serving

📅 2024-11-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Efficient large language model (LLM) serving faces a critical challenge in selecting optimal parallelization strategies—requiring careful trade-offs among computation, memory, communication, and energy across data, pipeline, and tensor parallelism, further complicated by divergent requirements of long-context inference versus long-sequence generation. This paper introduces the first lightweight, CPU-native LLM serving parallelization planner simulator. It features a novel dynamicity-aware, iteration-level batching modeling mechanism, coupled with structure-repetition-driven design-space pruning and modular heterogeneous device abstraction—enabling rapid exploration across multiple parallel paradigms, quantization formats, and trillion-parameter models. Experiments demonstrate: (1) optimal strategy identification within 15 minutes—3.37× faster than heuristic baselines; (2) up to 45% energy reduction versus latency-optimal configurations; and (3) 71× higher throughput and 1234× lower cost compared to cloud GPU deployments.

Technology Category

Application Category

📝 Abstract
Efficiently serving Large Language Models (LLMs) requires selecting an optimal parallel execution plan, balancing computation, memory, and communication overhead. However, determining the best strategy is challenging due to varying parallelism techniques (data, pipeline, tensor) and workload characteristics (e.g., compute-intensive tasks with long prompts vs. memory-intensive tasks with long generation). We propose APEX, an LLM serving system simulator that efficiently identifies optimal parallel execution plans by considering key factors of LLM serving systems, such as memory usage, batching behavior, etc. APEX performs dynamism-aware simulation to model iteration-level batching, and leverages LLMs' repetitive structure to reduce design space, scaling efficiently to trillion-scale models. APEX abstracts the key components of LLM serving systems, including the model, batching module, quantization formats, and device clusters, enabling the simulator to be general and extensible. Simulating on a CPU, APEX evaluates execution plans for various device clusters, covering diverse LLMs and workloads. APEX finds plans up to 3.37x faster than heuristics, and also plans that reduce energy consumption by up to 45% compared to latency-optimal plans. APEX performs comprehensive evaluations, reporting key system metrics like time per output token and time to first token, which can help service providers meet SLOs. APEX identifies an optimal plan within 15 minutes on a CPU, making it 71x faster and 1234x more cost-effective than cloud-based GPU deployment. APEX can be accessed at https://github.com/microsoft/apex_plus
Problem

Research questions and friction points this paper is trying to address.

Optimizing parallel execution plans for efficient LLM serving
Balancing computation, memory, and communication overhead in LLMs
Identifying dynamism-aware strategies for diverse workloads and models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamism-aware simulation for optimal parallel execution
Abstracts key LLM serving components for extensibility
Efficiently scales to trillion-scale models
🔎 Similar Papers
No similar papers found.
Yi-Chien Lin
Yi-Chien Lin
The Ohio State University
Computational LinguisticsPsycholinguisticsNatural Language Processing
Woosuk Kwon
Woosuk Kwon
PhD student, UC Berkeley
Machine LearningSystems
R
Ronald Pineda
University of California, Los Angeles, Los Angeles, California, USA
F
Fanny Nina Paravecino
Microsoft, Mountain View, California, USA