NUMINA: A Natural Understanding Benchmark for Multi-dimensional Intelligence and Numerical Reasoning Abilities

📅 2025-09-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing 3D benchmarks lack fine-grained numerical reasoning annotations, hindering multimodal large language models’ (MLLMs) capabilities in spatial measurement and complex geometric computation. To address this, we propose NUMINA—the first 3D multimodal benchmark for indoor-scene intelligence and numerical reasoning, covering precise spatial tasks such as distance estimation, volume calculation, and multi-step geometric reasoning. We design NUMINA-Flow, an automated annotation pipeline integrating LLM-based question rewriting and rule-guided self-verification, enabling scalable, high-precision, multi-scale QA generation. Additionally, we introduce Chat-Scene, a dedicated evaluation framework for 3D scene understanding and numerical reasoning. Extensive experiments reveal that state-of-the-art MLLMs exhibit significant deficiencies in 3D numerical reasoning—particularly in exact quantitative computation—highlighting critical gaps in current multimodal reasoning capabilities. This work establishes a new benchmark, methodology, and evaluation paradigm for 3D multimodal numerical reasoning.

Technology Category

Application Category

📝 Abstract
Recent advancements in 2D multimodal large language models (MLLMs) have significantly improved performance in vision-language tasks. However, extending these capabilities to 3D environments remains a distinct challenge due to the complexity of spatial reasoning. Nevertheless, existing 3D benchmarks often lack fine-grained numerical reasoning task annotations, limiting MLLMs' ability to perform precise spatial measurements and complex numerical reasoning. To address this gap, we introduce NUMINA, the first Natural Understanding benchmark for Multi-dimensional Intelligence and Numerical reasoning Abilities to enhance multimodal indoor perceptual understanding. NUMINA features multi-scale annotations and various question-answer pairs, generated using NUMINA-Flow, an automated annotation pipeline that integrates LLM rewriting and rule-based self-verification. We evaluate the performance of various state-of-the-art LLMs on NUMINA following the Chat-Scene framework, demonstrating that current LLMs struggle with multimodal numerical reasoning, particularly in performing precise computations such as distance and volume estimation, highlighting the need for further advancements in 3D models. The dataset and source codes can be obtained from https://github.com/fengshun124/NUMINA.
Problem

Research questions and friction points this paper is trying to address.

Extending 2D multimodal capabilities to complex 3D spatial reasoning environments
Addressing the lack of fine-grained numerical reasoning annotations in 3D benchmarks
Enhancing multimodal models' ability for precise spatial measurements and computations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces NUMINA benchmark for 3D numerical reasoning
Uses automated annotation pipeline with LLM rewriting
Evaluates models on precise spatial computations like distance
🔎 Similar Papers
No similar papers found.
Changyu Zeng
Changyu Zeng
XJTLU
self-supervised learningpoint cloudcomputer vision
Y
Yifan Wang
Department of Computer Science, University of Liverpool, United Kingdom
Zimu Wang
Zimu Wang
Tsinghua University
recommendation
W
Wei Wang
School of Advanced Technology, Xi’an Jiaotong-Liverpool University, China
Z
Zhengni Yang
School of Advanced Technology, Xi’an Jiaotong-Liverpool University, China
Muyi Bao
Muyi Bao
Carneige Mellon University
J
Jiming Xiao
School of Advanced Technology, Xi’an Jiaotong-Liverpool University, China
A
Ahn Nguyen
Department of Computer Science, University of Liverpool, United Kingdom
Y
Yutao Yue
The Hong Kong University of Science and Technology (Guangzhou), China