vLLM-Omni: Fully Disaggregated Serving for Any-to-Any Multimodal Models

πŸ“… 2026-02-02
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing model serving systems struggle to efficiently support multimodal models with arbitrary input-output configurations due to heterogeneous components that necessitate manual handling of cross-stage interactions and severely limit performance. This work proposes a fully decoupled serving architecture that automatically decomposes multimodal models into interconnected stages through a novel staged graph abstraction. By integrating independently optimized execution backends, stage-wise batching, flexible GPU allocation, and a unified cross-stage data connector, the system enables efficient coordination and automatic scheduling across heterogeneous engines such as large language models and diffusion models. Experimental results demonstrate that the proposed approach reduces end-to-end task completion time by up to 91.4%, significantly improving throughput and resource utilization.

Technology Category

Application Category

πŸ“ Abstract
Any-to-any multimodal models that jointly handle text, images, video, and audio represent a significant advance in multimodal AI. However, their complex architectures (typically combining multiple autoregressive LLMs, diffusion transformers, and other specialized components) pose substantial challenges for efficient model serving. Existing serving systems are mainly tailored to a single paradigm, such as autoregressive LLMs for text generation or diffusion transformers for visual generation. They lack support for any-to-any pipelines that involve multiple interconnected model components. As a result, developers must manually handle cross-stage interactions, leading to huge performance degradation. We present vLLM-Omni, a fully disaggregated serving system for any-to-any models. vLLM-Omni features a novel stage abstraction that enables users to decompose complex any-to-any architectures into interconnected stages represented as a graph, and a disaggregated stage execution backend that optimizes resource utilization and throughput across stages. Each stage is independently served by an LLM or diffusion engine with per-stage request batching, flexible GPU allocation, and unified inter-stage connectors for data routing. Experimental results demonstrate that vLLM-Omni reduces job completion time (JCT) by up to 91.4% compared to baseline methods. The code is public available at https://github.com/vllm-project/vllm-omni.
Problem

Research questions and friction points this paper is trying to address.

any-to-any multimodal models
model serving
disaggregated serving
multimodal AI
stage execution
Innovation

Methods, ideas, or system contributions that make the work stand out.

disaggregated serving
any-to-any multimodal models
stage abstraction
multimodal inference
resource optimization
πŸ”Ž Similar Papers
No similar papers found.
Peiqi Yin
Peiqi Yin
The Chinese University of Hong Kong
Serving SystemMachine Learning SystemCXL
J
Jiangyun Zhu
Institute of Software, Chinese Academy of Sciences
Han Gao
Han Gao
Distributed and Parallel Software Lab, Huawei
Generative AI
C
Chenguang Zheng
AI Framework and Data Technology Lab, Huawei
Y
Yongxiang Huang
AI Framework and Data Technology Lab, Huawei
T
Taichang Zhou
AI Framework and Data Technology Lab, Huawei
R
Ruirui Yang
AI Framework and Data Technology Lab, Huawei
Weizhi Liu
Weizhi Liu
εŽδΈœεΈˆθŒƒε€§ε­¦
AIGC securityGenerative watermarking
W
Weiqing Chen
AI Framework and Data Technology Lab, Huawei
C
Canlin Guo
AI Framework and Data Technology Lab, Huawei
Didan Deng
Didan Deng
The Hong Kong University of Science and Technology
Deep Learning
Z
Zifeng Mo
Sun Yat-sen University
Cong Wang
Cong Wang
Huawei Technologies Co., Ltd
J
James Cheng
The Chinese University of Hong Kong
Roger Wang
Roger Wang
Roblox
H
Hongsheng Liu
AI Framework and Data Technology Lab, Huawei