LiveStar: Live Streaming Assistant for Real-World Online Video Understanding

πŸ“… 2025-11-07
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing online Video-LLMs struggle to simultaneously achieve low-latency response and narrative coherence, particularly under continuous video stream input, where temporal misalignment and semantic discontinuity frequently occur. To address this, we propose an adaptive streaming decoding framework with three key innovations: (1) incremental video-language alignment; (2) single-forward-pass-driven temporal decision-making for response generation versus silence; and (3) a peak-end memory compression mechanism integrating streaming KV caching with memory-aware acceleration. The framework enables low-latency online understanding of videos exceeding 10 minutes. Evaluated on three benchmarks, it achieves an average 19.5% improvement in semantic correctness and reduces response temporal deviation by 18.1%. On OmniStar’s five tasks, it boosts FPS by 12.0%, significantly outperforming state-of-the-art methods.

Technology Category

Application Category

πŸ“ Abstract
Despite significant progress in Video Large Language Models (Video-LLMs) for offline video understanding, existing online Video-LLMs typically struggle to simultaneously process continuous frame-by-frame inputs and determine optimal response timing, often compromising real-time responsiveness and narrative coherence. To address these limitations, we introduce LiveStar, a pioneering live streaming assistant that achieves always-on proactive responses through adaptive streaming decoding. Specifically, LiveStar incorporates: (1) a training strategy enabling incremental video-language alignment for variable-length video streams, preserving temporal consistency across dynamically evolving frame sequences; (2) a response-silence decoding framework that determines optimal proactive response timing via a single forward pass verification; (3) memory-aware acceleration via peak-end memory compression for online inference on 10+ minute videos, combined with streaming key-value cache to achieve 1.53x faster inference. We also construct an OmniStar dataset, a comprehensive dataset for training and benchmarking that encompasses 15 diverse real-world scenarios and 5 evaluation tasks for online video understanding. Extensive experiments across three benchmarks demonstrate LiveStar's state-of-the-art performance, achieving an average 19.5% improvement in semantic correctness with 18.1% reduced timing difference compared to existing online Video-LLMs, while improving FPS by 12.0% across all five OmniStar tasks. Our model and dataset can be accessed at https://github.com/yzy-bupt/LiveStar.
Problem

Research questions and friction points this paper is trying to address.

Enables real-time video understanding with continuous frame-by-frame processing
Determines optimal response timing through adaptive streaming decoding
Improves inference speed and memory efficiency for long online videos
Innovation

Methods, ideas, or system contributions that make the work stand out.

Incremental video-language alignment for variable-length video streams
Response-silence decoding for optimal proactive response timing
Memory-aware acceleration with peak-end compression and streaming cache
πŸ”Ž Similar Papers
2024-02-20International Conference on Machine LearningCitations: 30
2024-06-09Annual Meeting of the Association for Computational LinguisticsCitations: 13
Z
Zhenyu Yang
Institute of Automation, Chinese Academy of Sciences, University of Chinese Academy of Sciences
K
Kairui Zhang
ShanghaiTech University
Y
Yuhang Hu
Institute of Automation, Chinese Academy of Sciences, University of Chinese Academy of Sciences
B
Bing Wang
Institute of Automation, Chinese Academy of Sciences, University of Chinese Academy of Sciences
Shengsheng Qian
Shengsheng Qian
PhD. Institute of Automation, Chinese Academy Sciences (CASIA)
Data MiningMultimediaSocial Media
Bin Wen
Bin Wen
快手
MLLM
F
Fan Yang
Kuaishou Technology
T
Tingting Gao
Kuaishou Technology
W
Weiming Dong
Institute of Automation, Chinese Academy of Sciences, University of Chinese Academy of Sciences
Changsheng Xu
Changsheng Xu
Professor, Institute of Automation, Chinese Academy of Sciences
MultimediaComputer vision