VideoScaffold: Elastic-Scale Visual Hierarchies for Streaming Video Understanding in MLLMs

📅 2025-12-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address high frame redundancy and challenges in modeling temporal coherence for streaming long-video understanding, this paper proposes an elastic-scale event modeling framework. Methodologically, it introduces (1) prediction-guided elastic-scale event segmentation (EES), the first of its kind, enabling duration-adaptive and semantics-driven dynamic video partitioning; and (2) a hierarchical event consolidation (HEC) module that constructs multi-granular visual representations—from frame-level to event-level—via semantic clustering and streaming incremental updates. The framework is fully compatible with image-based multimodal large language models (MLLMs) and supports zero-modification, plug-and-play integration. Evaluated on both offline and streaming long-video understanding benchmarks, it achieves state-of-the-art performance, significantly improving both accuracy and efficiency in temporal reasoning over extended video sequences.

Technology Category

Application Category

📝 Abstract
Understanding long videos with multimodal large language models (MLLMs) remains challenging due to the heavy redundancy across frames and the need for temporally coherent representations. Existing static strategies, such as sparse sampling, frame compression, and clustering, are optimized for offline settings and often produce fragmented or over-compressed outputs when applied to continuous video streams. We present VideoScaffold, a dynamic representation framework designed for streaming video understanding. It adaptively adjusts event granularity according to video duration while preserving fine-grained visual semantics. VideoScaffold introduces two key components: Elastic-Scale Event Segmentation (EES), which performs prediction-guided segmentation to dynamically refine event boundaries, and Hierarchical Event Consolidation (HEC), which progressively aggregates semantically related segments into multi-level abstractions. Working in concert, EES and HEC enable VideoScaffold to transition smoothly from fine-grained frame understanding to abstract event reasoning as the video stream unfolds. Extensive experiments across both offline and streaming video understanding benchmarks demonstrate that VideoScaffold achieves state-of-the-art performance. The framework is modular and plug-and-play, seamlessly extending existing image-based MLLMs to continuous video comprehension. The code is available at https://github.com/zheng980629/VideoScaffold.
Problem

Research questions and friction points this paper is trying to address.

Adaptively adjusts event granularity for streaming video understanding
Dynamically refines event boundaries to reduce frame redundancy
Progressively aggregates segments into multi-level event abstractions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic representation framework for streaming video understanding
Elastic-Scale Event Segmentation refines event boundaries adaptively
Hierarchical Event Consolidation aggregates segments into multi-level abstractions
🔎 Similar Papers
2024-02-20International Conference on Machine LearningCitations: 30
N
Naishan Zheng
University of Science and Technology of China
J
Jie Huang
University of Science and Technology of China
Qingpei Guo
Qingpei Guo
Ant Group
Multimodal LLMsVision-Language Models
F
Feng Zhao
University of Science and Technology of China