CogStream: Context-guided Streaming Video Question Answering

📅 2025-06-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing large video models face dual challenges in streaming video question answering: prohibitive computational overhead from processing full historical context, and performance degradation due to irrelevant information interfering with critical reasoning. To address this, we formally introduce *Context-Guided Streaming Video Reasoning (CogStream)*—a novel task emphasizing dynamic identification and selective utilization of the most relevant historical context for answering current streaming questions. To support this task, we construct the first densely hierarchically annotated streaming video QA dataset. We further propose a lightweight inference paradigm that jointly optimizes historical dialogue retrieval and visual stream compression. Additionally, we present CogReasoner, a baseline model integrating multi-granularity context modeling and semi-automatic data synthesis. Extensive experiments demonstrate substantial improvements in both reasoning accuracy and inference efficiency, validating the effectiveness of our context refinement mechanism.

Technology Category

Application Category

📝 Abstract
Despite advancements in Video Large Language Models (Vid-LLMs) improving multimodal understanding, challenges persist in streaming video reasoning due to its reliance on contextual information. Existing paradigms feed all available historical contextual information into Vid-LLMs, resulting in a significant computational burden for visual data processing. Furthermore, the inclusion of irrelevant context distracts models from key details. This paper introduces a challenging task called Context-guided Streaming Video Reasoning (CogStream), which simulates real-world streaming video scenarios, requiring models to identify the most relevant historical contextual information to deduce answers for questions about the current stream. To support CogStream, we present a densely annotated dataset featuring extensive and hierarchical question-answer pairs, generated by a semi-automatic pipeline. Additionally, we present CogReasoner as a baseline model. It efficiently tackles this task by leveraging visual stream compression and historical dialogue retrieval. Extensive experiments prove the effectiveness of this method. Code will be released soon.
Problem

Research questions and friction points this paper is trying to address.

Streaming video reasoning relies heavily on contextual information
Existing methods cause computational burden with irrelevant context
Identifying relevant historical context for accurate video QA
Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual stream compression for efficiency
Historical dialogue retrieval for relevance
Semi-automatic dataset generation pipeline
🔎 Similar Papers
No similar papers found.