🤖 AI Summary
Existing large video models face dual challenges in streaming video question answering: prohibitive computational overhead from processing full historical context, and performance degradation due to irrelevant information interfering with critical reasoning. To address this, we formally introduce *Context-Guided Streaming Video Reasoning (CogStream)*—a novel task emphasizing dynamic identification and selective utilization of the most relevant historical context for answering current streaming questions. To support this task, we construct the first densely hierarchically annotated streaming video QA dataset. We further propose a lightweight inference paradigm that jointly optimizes historical dialogue retrieval and visual stream compression. Additionally, we present CogReasoner, a baseline model integrating multi-granularity context modeling and semi-automatic data synthesis. Extensive experiments demonstrate substantial improvements in both reasoning accuracy and inference efficiency, validating the effectiveness of our context refinement mechanism.
📝 Abstract
Despite advancements in Video Large Language Models (Vid-LLMs) improving multimodal understanding, challenges persist in streaming video reasoning due to its reliance on contextual information. Existing paradigms feed all available historical contextual information into Vid-LLMs, resulting in a significant computational burden for visual data processing. Furthermore, the inclusion of irrelevant context distracts models from key details. This paper introduces a challenging task called Context-guided Streaming Video Reasoning (CogStream), which simulates real-world streaming video scenarios, requiring models to identify the most relevant historical contextual information to deduce answers for questions about the current stream. To support CogStream, we present a densely annotated dataset featuring extensive and hierarchical question-answer pairs, generated by a semi-automatic pipeline. Additionally, we present CogReasoner as a baseline model. It efficiently tackles this task by leveraging visual stream compression and historical dialogue retrieval. Extensive experiments prove the effectiveness of this method. Code will be released soon.