🤖 AI Summary
This work addresses the limitations of existing large video models in streaming scenarios, where they struggle with timely response to initial sufficient visual evidence, lack decision transparency, and suffer from inefficient memory mechanisms under offline evaluation paradigms. To overcome these challenges, the authors propose a novel framework that decouples reasoning control from memory integration. It introduces an Active Thinking Decision Module (ATDM) for confidence-driven, transparent responses and a Hierarchical Progressive Semantic Integration (HPSI) module that efficiently fuses cross-segment semantics through multi-level learnable tokens. Evaluated on StreamingBench, the method achieves 71.6% accuracy, surpassing the previous best result of 67.63%, and attains 46.9% performance on OVOBench, demonstrating significant improvements in evidence alignment, decision transparency, and computational efficiency for streaming video understanding.
📝 Abstract
Visual agents operating in the wild must respond to queries precisely when sufficient evidence first appears in a video stream, a critical capability that is overlooked by conventional video LLMs evaluated in offline settings. The shift to an online, streaming paradigm introduces significant challenges: a lack of decision transparency, the difficulty of aligning response timing with visual evidence, and the need to maintain a global, causally consistent understanding under tight computational budgets. To address these issues, we propose a novel framework that decouples reasoning control from memory integration. We introduce \textbf{\model{}}, an instantiation of this framework with two core components. First, the \emph{Active Thinking Decision Maker (ATDM)} is a transparent reasoning controller that externalizes its decision process using observable progress ($\boldsymbolρ$) and confidence ($\boldsymbol{c}$) metrics. This allows it to precisely time its response $t_r$ to match the first-sufficient-evidence timestamp $t^\star$ while streaming its reasoning to the user. Second, the \emph{Hierarchical Progressive Semantic Integration (HPSI)} module acts as an efficient memory system. It employs a set of learnable, multi-level aggregation tokens that are propagated across clips to build a rich, global cognitive state without exceeding token budgets. %Our approach sets a new standard on key online video understanding benchmarks, achieving strong performance of \textbf{71.6\%} on StreamingBench and \textbf{46.9\%} on OVOBench, demonstrating a robust solution for evidence-aligned and transparent online video analysis. Extensive experiments demonstrate the effectiveness of ATDM and HPSI, e.g., Thinking-QwenVL improves the accuracy of the previous state-of-the-art from 67.63\% to 71.60\% on the StreamingBench benchmark.