π€ AI Summary
Existing model serving systems struggle to efficiently support multimodal models with arbitrary input-output configurations due to heterogeneous components that necessitate manual handling of cross-stage interactions and severely limit performance. This work proposes a fully decoupled serving architecture that automatically decomposes multimodal models into interconnected stages through a novel staged graph abstraction. By integrating independently optimized execution backends, stage-wise batching, flexible GPU allocation, and a unified cross-stage data connector, the system enables efficient coordination and automatic scheduling across heterogeneous engines such as large language models and diffusion models. Experimental results demonstrate that the proposed approach reduces end-to-end task completion time by up to 91.4%, significantly improving throughput and resource utilization.
π Abstract
Any-to-any multimodal models that jointly handle text, images, video, and audio represent a significant advance in multimodal AI. However, their complex architectures (typically combining multiple autoregressive LLMs, diffusion transformers, and other specialized components) pose substantial challenges for efficient model serving. Existing serving systems are mainly tailored to a single paradigm, such as autoregressive LLMs for text generation or diffusion transformers for visual generation. They lack support for any-to-any pipelines that involve multiple interconnected model components. As a result, developers must manually handle cross-stage interactions, leading to huge performance degradation. We present vLLM-Omni, a fully disaggregated serving system for any-to-any models. vLLM-Omni features a novel stage abstraction that enables users to decompose complex any-to-any architectures into interconnected stages represented as a graph, and a disaggregated stage execution backend that optimizes resource utilization and throughput across stages. Each stage is independently served by an LLM or diffusion engine with per-stage request batching, flexible GPU allocation, and unified inter-stage connectors for data routing. Experimental results demonstrate that vLLM-Omni reduces job completion time (JCT) by up to 91.4% compared to baseline methods. The code is public available at https://github.com/vllm-project/vllm-omni.