🤖 AI Summary
This study addresses the heterogeneous inter-GPU communication patterns induced by tensor parallelism, pipeline parallelism, and their hybrid variants in distributed large language model (LLM) inference—patterns that critically impact end-to-end latency, network overhead, and service-level objective (SLO) compliance.
Method: We propose an integrated “measurement-driven analysis–analytical modeling–experimental validation” methodology, establishing a fine-grained communication behavior characterization framework to quantify communication latency and bandwidth consumption across varying sequence lengths, model scales, and hardware topologies.
Contribution/Results: We identify three key trade-offs: (i) tensor parallelism enables low-latency short-sequence responses but causes high network saturation; (ii) pipeline parallelism reduces per-transfer volume yet introduces substantial bubble latency; and (iii) hybrid parallelism requires dynamic, load-aware tuning of communication-computation overlap. Our findings provide interpretable, reusable theoretical foundations and practical guidelines for parallelism selection, communication optimization, and SLO-aware deployment in production-grade LLM inference systems.
📝 Abstract
Large Language Models (LLMs) built on transformer architectures have transformed natural language processing, achieving remarkable performance across diverse applications. While distributed inference frameworks enable practical deployment of these models, inter-GPU communication creates significant performance constraints that limit service quality in real-world systems. This paper investigates communication dynamics in distributed LLM serving-analyzing how various parallelization approaches coordinate data exchange between GPU workers during inference. We study dense transformer-based models as representative examples of contemporary architectures widely used in operational deployments. Our work combines detailed profiling measurements with predictive analytical models to characterize communication behavior across different parallelization configurations. Results show that tensor parallelism incurs substantial network overhead but delivers superior response times for brief sequences, pipeline parallelism minimizes data transfer requirements while increasing total latency, and combined approaches demand careful tuning to achieve balanced performance. These insights offer practical recommendations for selecting appropriate parallelization schemes in production LLM services and identify key opportunities for optimizing inference frameworks and communication infrastructure.