🤖 AI Summary
To address the challenge of processing multimodal streaming queries under strict latency and high-throughput constraints, this paper introduces the first end-to-end stream processing system centered on multimodal large language models (MMLMs) as core operators. We propose a novel three-layer joint optimization framework—spanning logical, physical, and semantic levels—that integrates query rewriting, cross-modal operator pushdown, and semantic compression to significantly reduce inference overhead while preserving model accuracy. Our prototype system, system{}, demonstrates over 10× higher throughput and one-order-of-magnitude lower end-to-end latency compared to state-of-the-art approaches. This work pioneers the deep integration of MMLMs into stream processing architectures, establishing a systematic foundation for scalable, low-latency multimodal real-time analytics and charting a new research paradigm for multimodal stream systems.
📝 Abstract
In this paper, we present a vision for a new generation of multimodal streaming systems that embed MLLMs as first-class operators, enabling real-time query processing across multiple modalities. Achieving this is non-trivial: while recent work has integrated MLLMs into databases for multimodal queries, streaming systems require fundamentally different approaches due to their strict latency and throughput requirements. Our approach proposes novel optimizations at all levels, including logical, physical, and semantic query transformations that reduce model load to improve throughput while preserving accuracy. We demonstrate this with system{}, a prototype leveraging such optimizations to improve performance by more than an order of magnitude. Moreover, we discuss a research roadmap that outlines open research challenges for building a scalable and efficient multimodal stream processing systems.