LiveStar: Live Streaming Assistant for Real-World Online Video Understanding

📅 2025-11-07

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

Existing online Video-LLMs struggle to simultaneously achieve low-latency response and narrative coherence, particularly under continuous video stream input, where temporal misalignment and semantic discontinuity frequently occur. To address this, we propose an adaptive streaming decoding framework with three key innovations: (1) incremental video-language alignment; (2) single-forward-pass-driven temporal decision-making for response generation versus silence; and (3) a peak-end memory compression mechanism integrating streaming KV caching with memory-aware acceleration. The framework enables low-latency online understanding of videos exceeding 10 minutes. Evaluated on three benchmarks, it achieves an average 19.5% improvement in semantic correctness and reduces response temporal deviation by 18.1%. On OmniStar’s five tasks, it boosts FPS by 12.0%, significantly outperforming state-of-the-art methods.

Technology Category

Application Category

📝 Abstract

Despite significant progress in Video Large Language Models (Video-LLMs) for offline video understanding, existing online Video-LLMs typically struggle to simultaneously process continuous frame-by-frame inputs and determine optimal response timing, often compromising real-time responsiveness and narrative coherence. To address these limitations, we introduce LiveStar, a pioneering live streaming assistant that achieves always-on proactive responses through adaptive streaming decoding. Specifically, LiveStar incorporates: (1) a training strategy enabling incremental video-language alignment for variable-length video streams, preserving temporal consistency across dynamically evolving frame sequences; (2) a response-silence decoding framework that determines optimal proactive response timing via a single forward pass verification; (3) memory-aware acceleration via peak-end memory compression for online inference on 10+ minute videos, combined with streaming key-value cache to achieve 1.53x faster inference. We also construct an OmniStar dataset, a comprehensive dataset for training and benchmarking that encompasses 15 diverse real-world scenarios and 5 evaluation tasks for online video understanding. Extensive experiments across three benchmarks demonstrate LiveStar's state-of-the-art performance, achieving an average 19.5% improvement in semantic correctness with 18.1% reduced timing difference compared to existing online Video-LLMs, while improving FPS by 12.0% across all five OmniStar tasks. Our model and dataset can be accessed at https://github.com/yzy-bupt/LiveStar.

Problem

Research questions and friction points this paper is trying to address.

Enables real-time video understanding with continuous frame-by-frame processing

Determines optimal response timing through adaptive streaming decoding

Improves inference speed and memory efficiency for long online videos

Innovation

Methods, ideas, or system contributions that make the work stand out.

Incremental video-language alignment for variable-length video streams

Response-silence decoding for optimal proactive response timing

Memory-aware acceleration with peak-end compression and streaming cache

🔎 Similar Papers

VideoPrism: A Foundational Visual Encoder for Video Understanding