CogStream: Context-guided Streaming Video Question Answering

📅 2025-06-12

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

Existing large video models face dual challenges in streaming video question answering: prohibitive computational overhead from processing full historical context, and performance degradation due to irrelevant information interfering with critical reasoning. To address this, we formally introduce *Context-Guided Streaming Video Reasoning (CogStream)*—a novel task emphasizing dynamic identification and selective utilization of the most relevant historical context for answering current streaming questions. To support this task, we construct the first densely hierarchically annotated streaming video QA dataset. We further propose a lightweight inference paradigm that jointly optimizes historical dialogue retrieval and visual stream compression. Additionally, we present CogReasoner, a baseline model integrating multi-granularity context modeling and semi-automatic data synthesis. Extensive experiments demonstrate substantial improvements in both reasoning accuracy and inference efficiency, validating the effectiveness of our context refinement mechanism.

Technology Category

Application Category

📝 Abstract

Despite advancements in Video Large Language Models (Vid-LLMs) improving multimodal understanding, challenges persist in streaming video reasoning due to its reliance on contextual information. Existing paradigms feed all available historical contextual information into Vid-LLMs, resulting in a significant computational burden for visual data processing. Furthermore, the inclusion of irrelevant context distracts models from key details. This paper introduces a challenging task called Context-guided Streaming Video Reasoning (CogStream), which simulates real-world streaming video scenarios, requiring models to identify the most relevant historical contextual information to deduce answers for questions about the current stream. To support CogStream, we present a densely annotated dataset featuring extensive and hierarchical question-answer pairs, generated by a semi-automatic pipeline. Additionally, we present CogReasoner as a baseline model. It efficiently tackles this task by leveraging visual stream compression and historical dialogue retrieval. Extensive experiments prove the effectiveness of this method. Code will be released soon.

Problem

Research questions and friction points this paper is trying to address.

Streaming video reasoning relies heavily on contextual information

Existing methods cause computational burden with irrelevant context

Identifying relevant historical context for accurate video QA

Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual stream compression for efficiency

Historical dialogue retrieval for relevance

Semi-automatic dataset generation pipeline

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs