StreamWise: Serving Multi-Modal Generation in Real-Time at Scale

📅 2026-03-06

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Real-time large-scale multimodal generation faces significant challenges including high latency, substantial costs, and complex resource scheduling. This work addresses these issues in the context of real-time podcast video generation by proposing the first adaptive service system that supports multimodal workflows. The system dynamically adjusts generation quality, co-schedules heterogeneous hardware resources, and integrates content and model parallelism to achieve an efficient trade-off among latency, cost, and output quality. By orchestrating large language models (LLMs), text-to-speech (TTS), and audio-visual generation models, the system produces a 10-minute video for under $25 on an A100 GPU—8.4× slower than real time—while a high-quality configuration achieves near-real-time streaming with sub-second startup latency at a cost below $45.

Technology Category

Application Category

📝 Abstract

Advances in multi-modal generative models are enabling new applications, from storytelling to automated media synthesis. Most current workloads generate simple outputs (e.g., image generation from a prompt) in batch mode, often requiring several seconds even for basic results. Serving real-time multi-modal workflows at scale is costly and complex, requiring efficient coordination of diverse models (each with unique resource needs) across language, audio, image, and video, all under strict latency and resource constraints. We tackle these challenges through the lens of real-time podcast video generation, integrating LLMs, text-to-speech, and video-audio generation. To meet tight SLOs, we design an adaptive, modular serving system, StreamWise, that dynamically manages quality (e.g., resolution, sharpness), model/content parallelism, and resource-aware scheduling. We leverage heterogeneous hardware to maximize responsiveness and efficiency. For example, the system can lower video resolution and allocate more resources to early scenes. We quantify the trade-offs between latency, cost, and quality. The cheapest setup generates a 10-minute podcast video on A100 GPUs in 1.4 hours (8.4x slower than the real-time) for less than \$25. StreamWise enables high-quality real-time streaming with a sub-second startup delay under $45.

Problem

Research questions and friction points this paper is trying to address.

multi-modal generation

real-time serving

latency constraints

resource efficiency

scalable inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-modal generation

real-time serving

adaptive quality control