ElasticMM: Efficient Multimodal LLMs Serving with Elastic Multimodal Parallelism

📅 2025-07-14

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Multimodal large language models (MLLMs) suffer from high time-to-first-token (TTFT) latency and low resource utilization, primarily due to modality heterogeneity, tightly coupled architectures, and inflexible parallelization strategies ill-suited for dynamic workloads. To address these challenges, we propose an elastic multimodal parallelism mechanism. Our approach features: (1) modality-aware load balancing and decoupled inference scheduling; (2) dynamic resource allocation with cross-stage elastic parallelism; and (3) a unified multimodal prefix cache with non-blocking encoding. Extensive experiments demonstrate that our system reduces TTFT by up to 4.2× and improves throughput by 3.2–4.5× over state-of-the-art MLLM serving systems, while consistently meeting stringent service-level agreement (SLA) requirements.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (MLLMs) extend LLMs to handle images, videos, and audio by incorporating feature extractors and projection modules. However, these additional components -- combined with complex inference pipelines and heterogeneous workloads -- introduce significant inference overhead. Therefore, efficiently serving MLLMs remains a major challenge. Current tightly coupled serving architectures struggle to distinguish between mixed request types or adapt parallelism strategies to different inference stages, leading to increased time-to-first-token (TTFT) latency and poor resource utilization. To address this, we propose Elastic Multimodal Parallelism (EMP), a new serving paradigm that elastically adapts to resource heterogeneity across request types and inference stages. Building upon EMP, we develop ElasticMM, an MLLM serving system that (1) separates requests into independent modality groups with dynamic resource allocation via a modality-aware load balancer; (2) decouples inference stages and enables parallelism adjustment and adaptive scaling via elastic partition scheduling; and (3) improves inference efficiency through unified multimodal prefix caching and non-blocking encoding. Experiments on diverse real-world datasets show that ElasticMM outperforms state-of-the-art (SOTA) serving systems, reducing TTFT by up to 4.2x and achieving 3.2-4.5x higher throughput while meeting service-level objectives (SLOs).

Problem

Research questions and friction points this paper is trying to address.

Efficiently serving multimodal LLMs with heterogeneous workloads

Reducing inference overhead from complex pipelines and components

Adapting parallelism strategies for diverse request types and stages

Innovation

Methods, ideas, or system contributions that make the work stand out.

Elastic Multimodal Parallelism for resource adaptation

Modality-aware load balancer for dynamic allocation

Unified multimodal prefix caching boosts efficiency

🔎 Similar Papers

DistTrain: Addressing Model and Data Heterogeneity with Disaggregated Training for Multimodal Large Language Models