Enabling Disaggregated Multi-Stage MLLM Inference via GPU-Internal Scheduling and Resource Sharing

📅 2025-12-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The three-stage pipeline of multimodal large language models (MLLMs)—comprising video decoding, visual encoding, and LLM inference—suffers from severe bottlenecks: CPU-bound video decoding dominates time-to-first-token (TTFT) and limits throughput; disjoint batching between the visual encoder and LLM causes cross-stage blocking and resource underutilization. Method: We propose FlashCodec and UnifiedServe—a co-optimized framework wherein FlashCodec enables multi-GPU collaborative video decoding, while UnifiedServe achieves logical decoupling and physical resource sharing to support dynamic scheduling and memory-compute reuse across visual encoding and LLM prefilling/decoding. Contribution/Results: Our approach breaks rigid stage isolation for the first time, enabling low-interference end-to-end pipelined parallelism. Experiments show 3.0× higher request throughput, 1.5× improvement in SLO compliance rate, and 4.4× higher overall throughput over state-of-the-art systems.

Technology Category

Application Category

📝 Abstract
Multimodal large language models (MLLMs) extend LLMs with visual understanding through a three-stage pipeline: multimodal preprocessing, vision encoding, and LLM inference. While these stages enhance capability, they introduce significant system bottlenecks. First, multimodal preprocessing-especially video decoding-often dominates Time-to-First-Token (TTFT). Most systems rely on CPU-based decoding, which severely limits throughput, while existing GPU-based approaches prioritize throughput-oriented parallelism and fail to meet the latency-sensitive requirements of MLLM inference. Second, the vision encoder is a standalone, compute-intensive stage that produces visual embeddings and cannot be co-batched with LLM prefill or decoding. This heterogeneity forces inter-stage blocking and increases token-generation latency. Even when deployed on separate GPUs, these stages underutilize available compute and memory resources, reducing overall utilization and constraining system throughput. To address these challenges, we present FlashCodec and UnifiedServe, two complementary designs that jointly optimize the end-to-end MLLM pipeline. FlashCodec accelerates the multimodal preprocessing stage through collaborative multi-GPU video decoding, reducing decoding latency while preserving high throughput. UnifiedServe optimizes the vision-to-text and inference stages using a logically decoupled their execution to eliminate inter-stage blocking, yet physically sharing GPU resources to maximize GPU system utilization. By carefully orchestrating execution across stages and minimizing interference, UnifiedServe Together, our proposed framework forms an end-to-end optimized stack that can serve up to 3.0$ imes$ more requests or enforce 1.5$ imes$ tighter SLOs, while achieving up to 4.4$ imes$ higher throughput compared to state-of-the-art systems.
Problem

Research questions and friction points this paper is trying to address.

Accelerate multimodal preprocessing to reduce first-token latency
Eliminate inter-stage blocking between vision encoding and LLM inference
Maximize GPU utilization across heterogeneous MLLM pipeline stages
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-GPU collaborative decoding reduces video preprocessing latency
Logical decoupling and physical resource sharing eliminate inter-stage blocking
End-to-end orchestration maximizes GPU utilization and throughput
🔎 Similar Papers
No similar papers found.