$M^3-Verse$: A "Spot the Difference" Challenge for Large Multimodal Models

📅 2025-12-21

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

Current large multimodal models lack the capability to consistently reason about dynamic object state changes across videos within a unified spatial context. Method: We introduce $M^3$-Verse—the first benchmark for multi-view video state change understanding—comprising 270 indoor scenes and 2,932 fine-grained questions. We formally define and quantify multi-state, multi-view, and multi-dimensional visual change understanding; propose a structured evaluation framework covering four core capabilities and 50+ subtasks; and design a lightweight baseline integrating dual-video alignment, cross-view spatiotemporal annotation, state-difference attention, and vision-language joint modeling. Contribution/Results: Evaluating 16 SOTA models reveals pervasive weaknesses in state-transition reasoning. Our approach achieves an overall accuracy improvement of 12.7%, with gains up to 23.4% on complex transformation subtasks, and supports interpretable analysis.

Technology Category

Application Category

📝 Abstract

Modern Large Multimodal Models (LMMs) have demonstrated extraordinary ability in static image and single-state spatial-temporal understanding. However, their capacity to comprehend the dynamic changes of objects within a shared spatial context between two distinct video observations, remains largely unexplored. This ability to reason about transformations within a consistent environment is particularly crucial for advancements in the field of spatial intelligence. In this paper, we introduce $M^3-Verse$, a Multi-Modal, Multi-State, Multi-Dimensional benchmark, to formally evaluate this capability. It is built upon paired videos that provide multi-perspective observations of an indoor scene before and after a state change. The benchmark contains a total of 270 scenes and 2,932 questions, which are categorized into over 50 subtasks that probe 4 core capabilities. We evaluate 16 state-of-the-art LMMs and observe their limitations in tracking state transitions. To address these challenges, we further propose a simple yet effective baseline that achieves significant performance improvements in multi-state perception. $M^3-Verse$ thus provides a challenging new testbed to catalyze the development of next-generation models with a more holistic understanding of our dynamic visual world. You can get the construction pipeline from https://github.com/Wal-K-aWay/M3-Verse_pipeline and full benchmark data from https://www.modelscope.cn/datasets/WalKaWay/M3-Verse.

Problem

Research questions and friction points this paper is trying to address.

Evaluates LMMs' ability to understand dynamic object changes between paired videos

Introduces a benchmark to test spatial-temporal reasoning in consistent environments

Addresses limitations in tracking state transitions for multi-modal perception

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces M3-Verse benchmark for multi-state video analysis

Proposes baseline method improving multi-state perception performance

Evaluates 16 LMMs revealing limitations in tracking state transitions

🔎 Similar Papers

No similar papers found.