🤖 AI Summary
Efficient large language model (LLM) serving faces a critical challenge in selecting optimal parallelization strategies—requiring careful trade-offs among computation, memory, communication, and energy across data, pipeline, and tensor parallelism, further complicated by divergent requirements of long-context inference versus long-sequence generation. This paper introduces the first lightweight, CPU-native LLM serving parallelization planner simulator. It features a novel dynamicity-aware, iteration-level batching modeling mechanism, coupled with structure-repetition-driven design-space pruning and modular heterogeneous device abstraction—enabling rapid exploration across multiple parallel paradigms, quantization formats, and trillion-parameter models. Experiments demonstrate: (1) optimal strategy identification within 15 minutes—3.37× faster than heuristic baselines; (2) up to 45% energy reduction versus latency-optimal configurations; and (3) 71× higher throughput and 1234× lower cost compared to cloud GPU deployments.
📝 Abstract
Efficiently serving Large Language Models (LLMs) requires selecting an optimal parallel execution plan, balancing computation, memory, and communication overhead. However, determining the best strategy is challenging due to varying parallelism techniques (data, pipeline, tensor) and workload characteristics (e.g., compute-intensive tasks with long prompts vs. memory-intensive tasks with long generation). We propose APEX, an LLM serving system simulator that efficiently identifies optimal parallel execution plans by considering key factors of LLM serving systems, such as memory usage, batching behavior, etc. APEX performs dynamism-aware simulation to model iteration-level batching, and leverages LLMs' repetitive structure to reduce design space, scaling efficiently to trillion-scale models. APEX abstracts the key components of LLM serving systems, including the model, batching module, quantization formats, and device clusters, enabling the simulator to be general and extensible. Simulating on a CPU, APEX evaluates execution plans for various device clusters, covering diverse LLMs and workloads. APEX finds plans up to 3.37x faster than heuristics, and also plans that reduce energy consumption by up to 45% compared to latency-optimal plans. APEX performs comprehensive evaluations, reporting key system metrics like time per output token and time to first token, which can help service providers meet SLOs. APEX identifies an optimal plan within 15 minutes on a CPU, making it 71x faster and 1234x more cost-effective than cloud-based GPU deployment. APEX can be accessed at https://github.com/microsoft/apex_plus