MARS2 2025 Challenge on Multimodal Reasoning: Datasets, Methods, Results, Discussion, and Outlook

📅 2025-09-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of comprehensive evaluation frameworks for multimodal reasoning in real-world and domain-specific applications. Methodologically, it introduces the MARS2 2025 Multimodal Reasoning Challenge benchmark—the first to systematically extend multimodal reasoning to specialized domains such as advertising video understanding. It releases two novel datasets: Lens (covering 12 everyday scenarios) and AdsQA (focused on advertising video reasoning), and defines three core tracks: real-world visual grounding, spatial-aware visual question answering, and advertising video reasoning. The benchmark integrates techniques spanning visual grounding, spatial relation modeling, and cross-modal reasoning, incorporating over 40 baseline models and 15 team submissions. It attracted 76 international teams, yielding 40+ valid submissions. All data, code, and leaderboards are fully open-sourced, establishing the first reproducible, general-purpose yet domain-inclusive multimodal reasoning evaluation framework.

Technology Category

Application Category

📝 Abstract
This paper reviews the MARS2 2025 Challenge on Multimodal Reasoning. We aim to bring together different approaches in multimodal machine learning and LLMs via a large benchmark. We hope it better allows researchers to follow the state-of-the-art in this very dynamic area. Meanwhile, a growing number of testbeds have boosted the evolution of general-purpose large language models. Thus, this year's MARS2 focuses on real-world and specialized scenarios to broaden the multimodal reasoning applications of MLLMs. Our organizing team released two tailored datasets Lens and AdsQA as test sets, which support general reasoning in 12 daily scenarios and domain-specific reasoning in advertisement videos, respectively. We evaluated 40+ baselines that include both generalist MLLMs and task-specific models, and opened up three competition tracks, i.e., Visual Grounding in Real-world Scenarios (VG-RS), Visual Question Answering with Spatial Awareness (VQA-SA), and Visual Reasoning in Creative Advertisement Videos (VR-Ads). Finally, 76 teams from the renowned academic and industrial institutions have registered and 40+ valid submissions (out of 1200+) have been included in our ranking lists. Our datasets, code sets (40+ baselines and 15+ participants' methods), and rankings are publicly available on the MARS2 workshop website and our GitHub organization page https://github.com/mars2workshop/, where our updates and announcements of upcoming events will be continuously provided.
Problem

Research questions and friction points this paper is trying to address.

Advancing multimodal reasoning through real-world and specialized scenarios
Evaluating models on tailored datasets for general and domain-specific tasks
Benchmarking 40+ baselines across three competition tracks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Created Lens and AdsQA multimodal datasets
Evaluated 40+ baseline models across tracks
Opened three specialized competition tracks
🔎 Similar Papers
No similar papers found.
P
Peng Xu
competition organizers
Shengwu Xiong
Shengwu Xiong
Wuhan University of Technology
Artificial Intelligence
Jiajun Zhang
Jiajun Zhang
Institute of Automation Chinese Academy of Sciences
Natural Language ProcessingLarge Language ModelsMultimodal Information Processing
Yaxiong Chen
Yaxiong Chen
Wuhan University of Technology
deep hashing、deep learning
B
Bowen Zhou
steering committee
Chen Change Loy
Chen Change Loy
President's Chair Professor, MMLab@NTU, S-Lab, Nanyang Technological University
Computer VisionImage ProcessingMachine Learning
David A. Clifton
David A. Clifton
Chair of Clinical Machine Learning, University of Oxford
Machine LearningClinical AIBiomedical Signal Processing
Kyoung Mu Lee
Kyoung Mu Lee
Professor, Department of Electrical and Computer Engineering, Seoul National University
Computer VisionMachine LearningArtificial Intelligence
Luc Van Gool
Luc Van Gool
professor computer vision INSAIT Sofia University, em. KU Leuven, em. ETHZ, Toyota Lab TRACE
computer visionmachine learningAIautonomous carscultural heritage
R
Ruiming He
organizing contributors
R
Ruilin Yao
organizing contributors
Xinwei Long
Xinwei Long
Tsinghua University
natural language processingmulti-modal learning
J
Jirui Huang
organizing contributors
K
Kai Tian
baseline implementors
S
Sa Yang
baseline implementors
Y
Yihua Shao
baseline implementors
J
Jin Feng
organizing contributors
Y
Yue Zhong
organizing contributors
J
Jiakai Zhou
organizing contributors
C
Cheng Tang
organizing contributors
T
Tianyu Zou
baseline implementors
Y
Yifang Zhang
baseline implementors
J
Junming Liang
baseline implementors
G
Guoyou Li
baseline implementors
Z
Zhaoxiang Wang
baseline implementors