Learning Streaming Video Representation via Multitask Training

📅 2025-04-28

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

To address the challenges of frame-sequence modeling, historical context preservation, and low-latency decision-making in real-time video stream understanding, this paper proposes StreamFormer—the first streaming visual backbone introducing causal temporal attention. Methodologically, it designs a strictly causal time-aware attention mechanism built upon the ViT architecture to enforce unidirectional temporal dependencies, and establishes a multi-task vision-language alignment training framework that jointly optimizes global semantics, temporal dynamics, and fine-grained spatial relationships. Evaluated on online action detection, video instance segmentation, and video question answering, StreamFormer achieves state-of-the-art accuracy with significantly lower latency, balancing precision and efficiency. Extensive experiments demonstrate its strong deployability in real-world online scenarios, including embodied intelligence and autonomous driving.

Technology Category

Application Category

📝 Abstract

Understanding continuous video streams plays a fundamental role in real-time applications including embodied AI and autonomous driving. Unlike offline video understanding, streaming video understanding requires the ability to process video streams frame by frame, preserve historical information, and make low-latency decisions.To address these challenges, our main contributions are three-fold. (i) We develop a novel streaming video backbone, termed as StreamFormer, by incorporating causal temporal attention into a pre-trained vision transformer. This enables efficient streaming video processing while maintaining image representation capability.(ii) To train StreamFormer, we propose to unify diverse spatial-temporal video understanding tasks within a multitask visual-language alignment framework. Hence, StreamFormer learns global semantics, temporal dynamics, and fine-grained spatial relationships simultaneously. (iii) We conduct extensive experiments on online action detection, online video instance segmentation, and video question answering. StreamFormer achieves competitive results while maintaining efficiency, demonstrating its potential for real-time applications.

Problem

Research questions and friction points this paper is trying to address.

Develop efficient streaming video processing for real-time applications

Unify diverse video tasks via multitask visual-language alignment

Achieve low-latency decisions while preserving historical information

Innovation

Methods, ideas, or system contributions that make the work stand out.

StreamFormer with causal temporal attention

Multitask visual-language alignment framework

Efficient real-time video processing

🔎 Similar Papers

VideoPrism: A Foundational Visual Encoder for Video Understanding