SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models

📅 2025-10-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video benchmarks primarily target general-scene perception and recognition, failing to assess advanced multimodal reasoning capabilities of large models in scientific domains. To address this gap, we introduce SciVideo-Bench—the first reasoning-oriented evaluation benchmark for scientific experiment videos—covering 25 disciplines and comprising 1,000 high-difficulty multiple-choice questions requiring spatiotemporal localization, domain-specific knowledge integration, and multi-step logical reasoning. The benchmark employs semi-automatic annotation augmented by expert verification to ensure rigor, enabling the first verifiable assessment of high-level cognitive tasks—including causal inference and mechanistic explanation—in scientific videos. Empirical evaluation reveals severe limitations of state-of-the-art multimodal foundation models (e.g., Gemini 2.5 Pro, Qwen2.5-VL), with average accuracy below 35%, exposing systemic deficits in scientific reasoning. This work establishes a novel evaluation paradigm for multimodal AI in scientific research and provides concrete directions for improvement.

Technology Category

Application Category

📝 Abstract
Large Multimodal Models (LMMs) have achieved remarkable progress across various capabilities; however, complex video reasoning in the scientific domain remains a significant and challenging frontier. Current video benchmarks predominantly target general scenarios where perception/recognition is heavily relied on, while with relatively simple reasoning tasks, leading to saturation and thus failing to effectively evaluate advanced multimodal cognitive skills. To address this critical gap, we introduce SciVideoBench, a rigorous benchmark specifically designed to assess advanced video reasoning in scientific contexts. SciVideoBench consists of 1,000 carefully crafted multiple-choice questions derived from cutting-edge scientific experimental videos spanning over 25 specialized academic subjects and verified by a semi-automatic system. Each question demands sophisticated domain-specific knowledge, precise spatiotemporal perception, and intricate logical reasoning, effectively challenging models'higher-order cognitive abilities. Our evaluation highlights significant performance deficits in state-of-the-art proprietary and open-source LMMs, including Gemini 2.5 Pro and Qwen2.5-VL, indicating substantial room for advancement in video reasoning capabilities. Detailed analyses of critical factors such as reasoning complexity and visual grounding provide valuable insights and clear direction for future developments in LMMs, driving the evolution of truly capable multimodal AI co-scientists. We hope SciVideoBench could fit the interests of the community and help to push the boundary of cutting-edge AI for border science.
Problem

Research questions and friction points this paper is trying to address.

Benchmarking scientific video reasoning in multimodal models
Addressing complex reasoning gaps in current video benchmarks
Evaluating domain-specific knowledge and spatiotemporal perception in videos
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces SciVideoBench for scientific video reasoning
Uses 1000 expert-verified multiple-choice science questions
Evaluates spatiotemporal perception and logical reasoning abilities
🔎 Similar Papers
No similar papers found.
A
Andong Deng
University of Central Florida
Taojiannan Yang
Taojiannan Yang
University of North Carolina at Chapel Hill
Shoubin Yu
Shoubin Yu
PhD Candidate at UNC Chapel Hill
Multimodal AIMachine LearningComputer VisionVideo Understanding
L
Lincoln Spencer
University of Central Florida
Mohit Bansal
Mohit Bansal
Parker Distinguished Professor, Computer Science, UNC Chapel Hill
Natural Language ProcessingComputer VisionMachine LearningMultimodal AI
C
Chen Chen
University of Central Florida
S
S. Yeung-Levy
Stanford University
X
Xiaohan Wang
Stanford University