vLLM-Omni: Fully Disaggregated Serving for Any-to-Any Multimodal Models

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Existing model serving systems struggle to efficiently support multimodal models with arbitrary input-output configurations due to heterogeneous components that necessitate manual handling of cross-stage interactions and severely limit performance. This work proposes a fully decoupled serving architecture that automatically decomposes multimodal models into interconnected stages through a novel staged graph abstraction. By integrating independently optimized execution backends, stage-wise batching, flexible GPU allocation, and a unified cross-stage data connector, the system enables efficient coordination and automatic scheduling across heterogeneous engines such as large language models and diffusion models. Experimental results demonstrate that the proposed approach reduces end-to-end task completion time by up to 91.4%, significantly improving throughput and resource utilization.

Technology Category

Application Category

📝 Abstract

Any-to-any multimodal models that jointly handle text, images, video, and audio represent a significant advance in multimodal AI. However, their complex architectures (typically combining multiple autoregressive LLMs, diffusion transformers, and other specialized components) pose substantial challenges for efficient model serving. Existing serving systems are mainly tailored to a single paradigm, such as autoregressive LLMs for text generation or diffusion transformers for visual generation. They lack support for any-to-any pipelines that involve multiple interconnected model components. As a result, developers must manually handle cross-stage interactions, leading to huge performance degradation. We present vLLM-Omni, a fully disaggregated serving system for any-to-any models. vLLM-Omni features a novel stage abstraction that enables users to decompose complex any-to-any architectures into interconnected stages represented as a graph, and a disaggregated stage execution backend that optimizes resource utilization and throughput across stages. Each stage is independently served by an LLM or diffusion engine with per-stage request batching, flexible GPU allocation, and unified inter-stage connectors for data routing. Experimental results demonstrate that vLLM-Omni reduces job completion time (JCT) by up to 91.4% compared to baseline methods. The code is public available at https://github.com/vllm-project/vllm-omni.

Problem

Research questions and friction points this paper is trying to address.

any-to-any multimodal models

model serving

disaggregated serving

multimodal AI

stage execution

Innovation

Methods, ideas, or system contributions that make the work stand out.

disaggregated serving

any-to-any multimodal models

stage abstraction