π€ AI Summary
Existing online Video-LLMs struggle to simultaneously achieve low-latency response and narrative coherence, particularly under continuous video stream input, where temporal misalignment and semantic discontinuity frequently occur. To address this, we propose an adaptive streaming decoding framework with three key innovations: (1) incremental video-language alignment; (2) single-forward-pass-driven temporal decision-making for response generation versus silence; and (3) a peak-end memory compression mechanism integrating streaming KV caching with memory-aware acceleration. The framework enables low-latency online understanding of videos exceeding 10 minutes. Evaluated on three benchmarks, it achieves an average 19.5% improvement in semantic correctness and reduces response temporal deviation by 18.1%. On OmniStarβs five tasks, it boosts FPS by 12.0%, significantly outperforming state-of-the-art methods.
π Abstract
Despite significant progress in Video Large Language Models (Video-LLMs) for offline video understanding, existing online Video-LLMs typically struggle to simultaneously process continuous frame-by-frame inputs and determine optimal response timing, often compromising real-time responsiveness and narrative coherence. To address these limitations, we introduce LiveStar, a pioneering live streaming assistant that achieves always-on proactive responses through adaptive streaming decoding. Specifically, LiveStar incorporates: (1) a training strategy enabling incremental video-language alignment for variable-length video streams, preserving temporal consistency across dynamically evolving frame sequences; (2) a response-silence decoding framework that determines optimal proactive response timing via a single forward pass verification; (3) memory-aware acceleration via peak-end memory compression for online inference on 10+ minute videos, combined with streaming key-value cache to achieve 1.53x faster inference. We also construct an OmniStar dataset, a comprehensive dataset for training and benchmarking that encompasses 15 diverse real-world scenarios and 5 evaluation tasks for online video understanding. Extensive experiments across three benchmarks demonstrate LiveStar's state-of-the-art performance, achieving an average 19.5% improvement in semantic correctness with 18.1% reduced timing difference compared to existing online Video-LLMs, while improving FPS by 12.0% across all five OmniStar tasks. Our model and dataset can be accessed at https://github.com/yzy-bupt/LiveStar.