ViSpeak: Visual Instruction Feedback in Streaming Videos

📅 2025-03-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the real-time, multimodal, and interactive challenges inherent in streaming video, this work introduces “Visual Instruction Feedback” — a novel task requiring models to perceive and execute semantic instructions in real time from dynamic visual signals (e.g., gestures, facial expressions) to enable natural human–machine interaction. Methodologically, we construct ViSpeak-Instruct, the first dedicated dataset for this task, and ViSpeak-Bench, a comprehensive evaluation benchmark; we further propose a streaming-capable large multimodal model (LMM) architecture integrating temporal visual encoding, instruction-aware alignment, and low-latency inference optimization. Experiments demonstrate that ViSpeak achieves state-of-the-art performance across multiple streaming video understanding benchmarks. After fine-tuning, it attains 92.3% accuracy on gesture-triggered dialogue—a subtask—matching the streaming multimodal understanding capability of GPT-4o.

Technology Category

Application Category

📝 Abstract
Recent advances in Large Multi-modal Models (LMMs) are primarily focused on offline video understanding. Instead, streaming video understanding poses great challenges to recent models due to its time-sensitive, omni-modal and interactive characteristics. In this work, we aim to extend the streaming video understanding from a new perspective and propose a novel task named Visual Instruction Feedback in which models should be aware of visual contents and learn to extract instructions from them. For example, when users wave their hands to agents, agents should recognize the gesture and start conversations with welcome information. Thus, following instructions in visual modality greatly enhances user-agent interactions. To facilitate research, we define seven key subtasks highly relevant to visual modality and collect the ViSpeak-Instruct dataset for training and the ViSpeak-Bench for evaluation. Further, we propose the ViSpeak model, which is a SOTA streaming video understanding LMM with GPT-4o-level performance on various streaming video understanding benchmarks. After finetuning on our ViSpeak-Instruct dataset, ViSpeak is equipped with basic visual instruction feedback ability, serving as a solid baseline for future research.
Problem

Research questions and friction points this paper is trying to address.

Extend streaming video understanding with visual instruction feedback.
Enhance user-agent interaction through visual content awareness.
Develop ViSpeak model for real-time video instruction recognition.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Instruction Feedback in streaming videos
ViSpeak model with GPT-4o-level performance
ViSpeak-Instruct dataset for training and evaluation
🔎 Similar Papers
No similar papers found.