NUMINA: A Natural Understanding Benchmark for Multi-dimensional Intelligence and Numerical Reasoning Abilities

📅 2025-09-20

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Existing 3D benchmarks lack fine-grained numerical reasoning annotations, hindering multimodal large language models’ (MLLMs) capabilities in spatial measurement and complex geometric computation. To address this, we propose NUMINA—the first 3D multimodal benchmark for indoor-scene intelligence and numerical reasoning, covering precise spatial tasks such as distance estimation, volume calculation, and multi-step geometric reasoning. We design NUMINA-Flow, an automated annotation pipeline integrating LLM-based question rewriting and rule-guided self-verification, enabling scalable, high-precision, multi-scale QA generation. Additionally, we introduce Chat-Scene, a dedicated evaluation framework for 3D scene understanding and numerical reasoning. Extensive experiments reveal that state-of-the-art MLLMs exhibit significant deficiencies in 3D numerical reasoning—particularly in exact quantitative computation—highlighting critical gaps in current multimodal reasoning capabilities. This work establishes a new benchmark, methodology, and evaluation paradigm for 3D multimodal numerical reasoning.

Technology Category

Application Category

📝 Abstract

Recent advancements in 2D multimodal large language models (MLLMs) have significantly improved performance in vision-language tasks. However, extending these capabilities to 3D environments remains a distinct challenge due to the complexity of spatial reasoning. Nevertheless, existing 3D benchmarks often lack fine-grained numerical reasoning task annotations, limiting MLLMs' ability to perform precise spatial measurements and complex numerical reasoning. To address this gap, we introduce NUMINA, the first Natural Understanding benchmark for Multi-dimensional Intelligence and Numerical reasoning Abilities to enhance multimodal indoor perceptual understanding. NUMINA features multi-scale annotations and various question-answer pairs, generated using NUMINA-Flow, an automated annotation pipeline that integrates LLM rewriting and rule-based self-verification. We evaluate the performance of various state-of-the-art LLMs on NUMINA following the Chat-Scene framework, demonstrating that current LLMs struggle with multimodal numerical reasoning, particularly in performing precise computations such as distance and volume estimation, highlighting the need for further advancements in 3D models. The dataset and source codes can be obtained from https://github.com/fengshun124/NUMINA.

Problem

Research questions and friction points this paper is trying to address.

Extending 2D multimodal capabilities to complex 3D spatial reasoning environments

Addressing the lack of fine-grained numerical reasoning annotations in 3D benchmarks

Enhancing multimodal models' ability for precise spatial measurements and computations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces NUMINA benchmark for 3D numerical reasoning

Uses automated annotation pipeline with LLM rewriting

Evaluates models on precise spatial computations like distance

🔎 Similar Papers

No similar papers found.