MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning

📅 2025-06-05

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Existing multimodal reasoning benchmarks suffer from three key limitations: reliance on static images, narrow task coverage (predominantly mathematical), and rapid performance saturation. To address these, we introduce MORSE-500—the first programmable video benchmark—comprising 500 script-generated dynamic video clips spanning six reasoning domains: abstraction, physics, planning, spatial reasoning, temporal reasoning, and mathematics. Our method employs a script-driven, controllable generation paradigm, enabling precise modulation of complexity, distractors, and temporal dynamics; it further supports open-ended evolution via deterministic rendering (Manim/Matplotlib/MoviePy) and integration with generative video models. We release the full evaluation framework and all generation scripts. Evaluation on state-of-the-art models—including Gemini 2.5 Pro and OpenAI o3—reveals critical gaps: accuracy on abstraction and planning tasks remains below 40%, exposing fundamental weaknesses in current multimodal reasoning capabilities.

Technology Category

Application Category

📝 Abstract

Despite rapid advances in vision-language models (VLMs), current benchmarks for multimodal reasoning fall short in three key dimensions. First, they overwhelmingly rely on static images, failing to capture the temporal complexity of real-world environments. Second, they narrowly focus on mathematical problem-solving, neglecting the broader spectrum of reasoning skills -- including abstract, physical, planning, spatial, and temporal capabilities -- required for robust multimodal intelligence. Third, many benchmarks quickly saturate, offering limited headroom for diagnosing failure modes or measuring continued progress. We introduce MORSE-500 (Multimodal Reasoning Stress-test Environment), a video benchmark composed of 500 fully scripted clips with embedded questions spanning six complementary reasoning categories. Each instance is programmatically generated using deterministic Python scripts (via Manim, Matplotlib, MoviePy), generative video models, and curated real footage. This script-driven design allows fine-grained control over visual complexity, distractor density, and temporal dynamics -- enabling difficulty to be scaled systematically as models improve. Unlike static benchmarks that become obsolete once saturated, MORSE-500 is built to evolve: its controllable generation pipeline supports the creation of arbitrarily challenging new instances, making it ideally suited for stress-testing next-generation models. Initial experiments with state-of-the-art systems -- including various Gemini 2.5 Pro and OpenAI o3 which represent the strongest available at the time, alongside strong open-source models -- reveal substantial performance gaps across all categories, with particularly large deficits in abstract and planning tasks. We release the full dataset, generation scripts, and evaluation harness to support transparent, reproducible, and forward-looking multimodal reasoning research.

Problem

Research questions and friction points this paper is trying to address.

Current benchmarks lack temporal complexity in real-world environments

Existing benchmarks focus narrowly on math, missing broader reasoning skills

Many benchmarks saturate quickly, limiting progress measurement and diagnosis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Programmatically generated video benchmark

Controllable visual and temporal complexity

Supports evolving difficulty levels

🔎 Similar Papers

No similar papers found.