VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos

📅 2025-06-05

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Mathematical reasoning in real-world instructional videos poses unique challenges—including fine-grained visual understanding, handwritten/digital text recognition, and nonlinear temporal integration of spoken language—exceeding those in static image or text-only settings. To address this, we introduce the first multimodal mathematical reasoning benchmark specifically designed for educational videos: it spans 10 mathematical domains, encompasses video durations from 10 seconds to one hour, and supports three temporally grounded task types—direct solution, conceptual transfer, and deep instruction comprehension. Our benchmark features novel fine-grained, multi-step reasoning annotations and a cross-modal, nonlinear temporal alignment evaluation framework. It is constructed via expert human annotation (920+ person-hours), multimodal synchronized alignment, and hierarchical reasoning chain design. Empirical evaluation reveals substantial performance bottlenecks in current vision-language models (VLMs) on this benchmark. Both the benchmark and evaluation code are publicly released.

Technology Category

Application Category

📝 Abstract

Mathematical reasoning in real-world video settings presents a fundamentally different challenge than in static images or text. It requires interpreting fine-grained visual information, accurately reading handwritten or digital text, and integrating spoken cues, often dispersed non-linearly over time. In such multimodal contexts, success hinges not just on perception, but on selectively identifying and integrating the right contextual details from a rich and noisy stream of content. To this end, we introduce VideoMathQA, a benchmark designed to evaluate whether models can perform such temporally extended cross-modal reasoning on videos. The benchmark spans 10 diverse mathematical domains, covering videos ranging from 10 seconds to over 1 hour. It requires models to interpret structured visual content, understand instructional narratives, and jointly ground concepts across visual, audio, and textual modalities. We employ graduate-level experts to ensure high quality, totaling over $920$ man-hours of annotation. To reflect real-world scenarios, questions are designed around three core reasoning challenges: direct problem solving, where answers are grounded in the presented question; conceptual transfer, which requires applying learned methods to new problems; and deep instructional comprehension, involving multi-step reasoning over extended explanations and partially worked-out solutions. Each question includes multi-step reasoning annotations, enabling fine-grained diagnosis of model capabilities. Through this benchmark, we highlight the limitations of existing approaches and establish a systematic evaluation framework for models that must reason, rather than merely perceive, across temporally extended and modality-rich mathematical problem settings. Our benchmark and evaluation code are available at: https://mbzuai-oryx.github.io/VideoMathQA

Problem

Research questions and friction points this paper is trying to address.

Benchmarking mathematical reasoning in multimodal video contexts

Evaluating cross-modal integration of visual, audio, and textual cues

Assessing multi-step reasoning across diverse mathematical domains

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal video understanding for math reasoning

Temporally extended cross-modal integration

Expert-annotated benchmark with diverse domains

🔎 Similar Papers

MathScape: Evaluating MLLMs in multimodal Math Scenarios through a Hierarchical Benchmark