Proact-VL: A Proactive VideoLLM for Real-Time AI Companions

📅 2026-03-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes Proact-VL, the first proactive multimodal language model framework designed for continuous video streams, aiming to enable low-latency, content-controllable real-time AI interaction. By integrating streaming video understanding, response triggering mechanisms, and generation control strategies, Proact-VL endows AI agents with environmental awareness and autonomous decision-making capabilities, facilitating natural and efficient real-time interaction in scenarios such as game commentary and guidance. Experiments on the newly curated Live Gaming Benchmark dataset demonstrate that Proact-VL outperforms existing methods in response latency, generation quality, and video comprehension, thereby validating its effectiveness and practicality for real-time interactive applications.

Technology Category

Application Category

📝 Abstract
Proactive and real-time interactive experiences are essential for human-like AI companions, yet face three key challenges: (1) achieving low-latency inference under continuous streaming inputs, (2) autonomously deciding when to respond, and (3) controlling both quality and quantity of generated content to meet real-time constraints. In this work, we instantiate AI companions through two gaming scenarios, commentator and guide, selected for their suitability for automatic evaluation. We introduce the Live Gaming Benchmark, a large-scale dataset with three representative scenarios: solo commentary, co-commentary, and user guidance, and present Proact-VL, a general framework that shapes multimodal language models into proactive, real-time interactive agents capable of human-like environment perception and interaction. Extensive experiments show Proact-VL achieves superior response latency and quality while maintaining strong video understanding capabilities, demonstrating its practicality for real-time interactive applications.
Problem

Research questions and friction points this paper is trying to address.

real-time AI companions
low-latency inference
proactive response
multimodal interaction
video understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proactive VideoLLM
Real-Time Interaction
Low-Latency Inference
Multimodal Language Model
Live Gaming Benchmark