LION-FS: Fast&Slow Video-Language Thinker as Online Video Assistant

📅 2025-03-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing online video assistants struggle to balance real-time responsiveness and semantic fidelity under low-frame-rate conditions, resulting in coarse visual representations, delayed responses, inaccurate temporal localization, and superficial contextual understanding. To address this, we propose a real-time online video assistant tailored for first-person videos, introducing the novel Fast&Slow dual-path architecture: the Fast path enables millisecond-level frame-wise decisions via token aggregation and pruning routing; the Slow path integrates multi-granularity keyframe enhancement with a multimodal Thinking Template to produce fine-grained, context-aware responses. Additionally, our method incorporates human–environment interaction modeling and multi-granularity spatial pooling. Evaluated across multiple online video understanding tasks, our approach achieves state-of-the-art performance in both effectiveness and efficiency—reducing response latency by 37%, improving action localization accuracy by 12.6% mAP, and enhancing contextual reasoning depth.

Technology Category

Application Category

📝 Abstract
First-person video assistants are highly anticipated to enhance our daily lives through online video dialogue. However, existing online video assistants often sacrifice assistant efficacy for real-time efficiency by processing low-frame-rate videos with coarse-grained visual features.To overcome the trade-off between efficacy and efficiency, we propose"Fast&Slow Video-Language Thinker"as an onLIne videO assistaNt, LION-FS, achieving real-time, proactive, temporally accurate, and contextually precise responses. LION-FS adopts a two-stage optimization strategy: 1)Fast Path: Routing-Based Response Determination evaluates frame-by-frame whether an immediate response is necessary. To enhance response determination accuracy and handle higher frame-rate inputs efficiently, we employ Token Aggregation Routing to dynamically fuse spatiotemporal features without increasing token numbers, while utilizing Token Dropping Routing to eliminate redundant features. 2)Slow Path: Multi-granularity Keyframe Augmentation optimizes keyframes during response generation. To provide comprehensive and detailed responses beyond atomic actions constrained by training data, fine-grained spatial features and human-environment interaction features are extracted through multi-granular pooling. These features are further integrated into a meticulously designed multimodal Thinking Template to guide more precise response generation. Comprehensive evaluations on online video tasks demonstrate that LION-FS achieves state-of-the-art efficacy and efficiency.
Problem

Research questions and friction points this paper is trying to address.

Overcomes trade-off between efficacy and efficiency in video assistants
Enhances real-time, proactive, and contextually precise video responses
Optimizes keyframes and features for detailed, accurate video understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage optimization strategy for video processing
Token Aggregation Routing for efficient feature fusion
Multi-granularity Keyframe Augmentation for detailed responses
🔎 Similar Papers
2024-06-09Annual Meeting of the Association for Computational LinguisticsCitations: 13
W
Wei Li
Harbin Institute of Technology, Shenzhen
Bing Hu
Bing Hu
Unknown affiliation
Machine LearningData MiningStatistics
Rui Shao
Rui Shao
Professor, Harbin Institute of Technology (Shenzhen)
Computer VisionMultimodal LLMEmbodied AI
L
Leyang Shen
Harbin Institute of Technology, Shenzhen
L
Liqiang Nie
Harbin Institute of Technology, Shenzhen