🤖 AI Summary
Evaluating large vision-language models’ (VLMs) capacity for deep knowledge acquisition and application from domain-specific videos remains an open challenge. Method: We introduce VKB, the first multi-disciplinary video knowledge acquisition benchmark, comprising 300 expert-curated videos and 900 human-annotated questions, structured along a cognitive progression: perception → comprehension → transfer. We propose a stage-aligned evaluation framework and Δknowledge—a novel metric quantifying post-viewing knowledge gain. Contribution/Results: Experiments reveal that state-of-the-art VLMs suffer over 40% accuracy degradation from perception to transfer stages; human-VLM performance gaps reach 58.7%. VKB uncovers critical bottlenecks in video-driven higher-order cognition and establishes a new paradigm and standardized toolset for evaluating VLMs’ video understanding capabilities.
📝 Abstract
Humans acquire knowledge through three cognitive stages: perceiving information, comprehending knowledge, and adapting knowledge to solve novel problems. Videos serve as an effective medium for this learning process, facilitating a progression through these cognitive stages. However, existing video benchmarks fail to systematically evaluate the knowledge acquisition capabilities in Large Multimodal Models (LMMs). To address this gap, we introduce Video-MMMU, a multi-modal, multi-disciplinary benchmark designed to assess LMMs' ability to acquire and utilize knowledge from videos. Video-MMMU features a curated collection of 300 expert-level videos and 900 human-annotated questions across six disciplines, evaluating knowledge acquisition through stage-aligned question-answer pairs: Perception, Comprehension, and Adaptation. A proposed knowledge gain metric, {Delta}knowledge, quantifies improvement in performance after video viewing. Evaluation of LMMs reveals a steep decline in performance as cognitive demands increase and highlights a significant gap between human and model knowledge acquisition, underscoring the need for methods to enhance LMMs' capability to learn and adapt from videos.