Outside Knowledge Conversational Video (OKCV) Dataset -- Dialoguing over Videos

📅 2025-06-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of temporal visual understanding, external knowledge integration, and contextual consistency modeling in video-based multi-turn dialogue. To this end, we introduce the first external knowledge–enhanced video multi-turn dialogue task and present OKCV, a benchmark dataset comprising 2,017 videos, 5,986 human-annotated dialogues (40,954 turns), with each turn aligned to a video segment, dialogue history, and an external knowledge triple. We formally define and implement a joint modeling paradigm that is video-driven, knowledge-dependent, and context-aware—breaking the “visual-closed” assumption inherent in conventional VQA and video dialogue systems. Our strong baseline integrates multimodal alignment, temporal attention, and knowledge retrieval. Experiments reveal critical bottlenecks in existing methods regarding cross-modal knowledge fusion and long-horizon dialogue coherence. OKCV establishes a rigorous, standardized benchmark and a comprehensive evaluation framework for advancing knowledge-augmented video dialogue research.

Technology Category

Application Category

📝 Abstract
In outside knowledge visual question answering (OK-VQA), the model must identify relevant visual information within an image and incorporate external knowledge to accurately respond to a question. Extending this task to a visually grounded dialogue setting based on videos, a conversational model must both recognize pertinent visual details over time and answer questions where the required information is not necessarily present in the visual information. Moreover, the context of the overall conversation must be considered for the subsequent dialogue. To explore this task, we introduce a dataset comprised of $2,017$ videos with $5,986$ human-annotated dialogues consisting of $40,954$ interleaved dialogue turns. While the dialogue context is visually grounded in specific video segments, the questions further require external knowledge that is not visually present. Thus, the model not only has to identify relevant video parts but also leverage external knowledge to converse within the dialogue. We further provide several baselines evaluated on our dataset and show future challenges associated with this task. The dataset is made publicly available here: https://github.com/c-patsch/OKCV.
Problem

Research questions and friction points this paper is trying to address.

Extends OK-VQA to video-based dialogues requiring external knowledge
Identifies relevant video segments and integrates non-visual information
Considers conversation context for coherent multi-turn dialogue responses
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages external knowledge for video dialogues
Identifies relevant visual details over time
Considers conversation context for responses
🔎 Similar Papers
No similar papers found.
Benjamin Reichman
Benjamin Reichman
Georgia Tech
Artificial IntelligenceMachine Learning
C
Constantin Patsch
Technical University of Munich
J
Jack Truxal
Georgia Institute of Technology
A
Atishay Jain
Georgia Institute of Technology
Larry Heck
Larry Heck
Professor, Georgia Institute of Technology
conversational AIdialogue systemsmultimodal LLMsspeech technology