VideoMind: An Omni-Modal Video Dataset with Intent Grounding for Deep-Cognitive Video Understanding

📅 2025-07-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video datasets suffer from insufficient modeling of implicit intent and lack fine-grained cross-modal alignment. To address this, VideoMind introduces the first full-modality video dataset designed for deep cognitive understanding, comprising 103K video samples—each annotated with synchronized audio and a three-tier textual description (factual, abstract, and intentional). It innovatively adopts a chain-of-thought–based contextual reasoning paradigm for intent annotation and establishes a manually verified gold-standard benchmark of 3,000 samples. Leveraging large language models, VideoMind performs multi-level semantic annotation (covering entities, scenes, events, actions, and intents) and proposes a hybrid cognitive retrieval evaluation framework. This significantly advances performance on high-level video understanding tasks, including emotion recognition and intent comprehension. The dataset is publicly available on GitHub, Hugging Face, and OpenDataLab.

Technology Category

Application Category

📝 Abstract
This paper introduces VideoMind, a video-centric omni-modal dataset designed for deep video content cognition and enhanced multi-modal feature representation. The dataset comprises 103K video samples (3K reserved for testing), each paired with audio and systematically detailed textual descriptions. Specifically, every video and its audio is described across three hierarchical layers (factual, abstract, and intent), progressing from surface to depth. It contains over 22 million words, averaging ~225 words per sample. VideoMind's key distinction from existing datasets is its provision of intent expressions, which require contextual integration across the entire video and are not directly observable. These deep-cognitive expressions are generated using a Chain-of-Thought (COT) approach, prompting the mLLM through step-by-step reasoning. Each description includes annotations for subject, place, time, event, action, and intent, supporting downstream recognition tasks. Crucially, we establish a gold-standard benchmark with 3,000 manually validated samples for evaluating deep-cognitive video understanding. We design hybrid-cognitive retrieval experiments, scored by multi-level retrieval metrics, to appropriately assess deep video comprehension. Evaluation results for models (e.g., InternVideo, VAST, UMT-L) are released. VideoMind serves as a powerful benchmark for fine-grained cross-modal alignment and advances fields requiring in-depth video understanding, such as emotion and intent recognition. The data is publicly available on GitHub, HuggingFace, and OpenDataLab, https://github.com/cdx-cindy/VideoMind.
Problem

Research questions and friction points this paper is trying to address.

Develops VideoMind dataset for deep video cognition
Provides intent expressions via Chain-of-Thought reasoning
Establishes benchmark for fine-grained cross-modal alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Omni-modal dataset with intent grounding
Chain-of-Thought approach for deep cognition
Hybrid-cognitive retrieval experiments for evaluation
🔎 Similar Papers
No similar papers found.
Baoyao Yang
Baoyao Yang
Guangdong University of Technology
Wanyun Li
Wanyun Li
Fudan Universiry
深度学习 计算机视觉
D
Dixin Chen
Guangdong University of Technology, China
Junxiang Chen
Junxiang Chen
Wechat, Tencent, China
W
Wenbin Yao
Wechat, Tencent, China
H
Haifeng Lin
Guangdong University of Technology, China