AURA: Always-On Understanding and Real-Time Assistance via Video Streams

📅 2026-04-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of existing video large language models, which are predominantly confined to offline processing and struggle to support open-domain, long-duration real-time video stream interaction. To overcome this, the authors propose an end-to-end streaming visual interaction framework that unifies continuous video understanding and real-time response generation within a single architecture, thereby transcending the constraints of conventional decoupled trigger-response mechanisms. The approach integrates context management, tailored data construction, specialized training objectives, and deployment optimizations, and incorporates ASR and TTS modules to enable real-time inference. Evaluated on a streaming video understanding benchmark, the method achieves state-of-the-art performance and demonstrates a real-time system running at 2 FPS on dual 80GB GPUs.
📝 Abstract
Video Large Language Models (VideoLLMs) have achieved strong performance on many video understanding tasks, but most existing systems remain offline and are not well-suited for live video streams that require continuous observation and timely response. Recent streaming VideoLLMs have made progress, yet current approaches often rely on decoupled trigger-response pipelines or are limited to captioning-style narration, reducing their effectiveness for open-ended question answering and long-horizon interaction. We propose AURA (Always-On Understanding and Real-Time Assistance), an end-to-end streaming visual interaction framework that enables a unified VideoLLM to continuously process video streams and support both real-time question answering and proactive responses. AURA integrates context management, data construction, training objectives, and deployment optimization for stable long-horizon streaming interaction. It achieves state-of-the-art performance on streaming benchmarks and supports a real-time demo system with ASR and TTS running at 2 FPS on two 80G accelerators. We release the AURA model together with a real-time inference framework to facilitate future research.
Problem

Research questions and friction points this paper is trying to address.

Video Large Language Models
real-time video streams
open-ended question answering
long-horizon interaction
streaming understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Streaming VideoLLM
Real-Time Interaction
Always-On Understanding
End-to-End Framework
Long-Horizon Context
🔎 Similar Papers
No similar papers found.
Xudong Lu
Xudong Lu
PhD student, the Chinese University of Hong Kong
Computer VisionMachine Learning
Y
Yang Bo
Huawei Research
Jinpeng Chen
Jinpeng Chen
City University of Hong Kong
Continual LearningMultimodal Large Language Model
S
Shuhan Li
Huawei Research
X
Xintong Guo
Huawei Research
H
Huankang Guan
Huawei Research
F
Fang Liu
Huawei Research
D
Dunyuan Xu
Huawei Research
Peiwen Sun
Peiwen Sun
Multimedia lab, The Chinese University of Hong Kong
multimodal learning
H
Heyang Sun
Huawei Research
R
Rui Liu
Huawei Research
H
Hongsheng Li
CUHK MMLab