MAG-3D: Multi-Agent Grounded Reasoning for 3D Understanding

📅 2026-04-10

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Existing vision-language models struggle with open-domain grounding and reasoning in 3D scenes, particularly exhibiting limited zero-shot generalization and inadequate comprehension of complex spatial relationships. This work proposes a training-free multi-agent framework comprising three expert agents—planner, grounder, and coder—that collaboratively perform dynamic task decomposition, free-form object grounding, and executable program-driven geometric reasoning. For the first time, this approach achieves fully fine-tuning-free multi-agent collaboration by leveraging off-the-shelf vision-language models integrated with task planning, 3D grounding, relevant frame retrieval, and code generation. The method attains state-of-the-art performance across multiple 3D understanding benchmarks, significantly enhancing the accuracy and robustness of zero-shot reasoning without any model adaptation.

Technology Category

Application Category

📝 Abstract

Vision-language models (VLMs) have achieved strong performance in multimodal understanding and reasoning, yet grounded reasoning in 3D scenes remains underexplored. Effective 3D reasoning hinges on accurate grounding: to answer open-ended queries, a model must first identify query-relevant objects and regions in a complex scene, and then reason about their spatial and geometric relationships. Recent approaches have demonstrated strong potential for grounded 3D reasoning. However, they often rely on in-domain tuning or hand-crafted reasoning pipelines, which limit their flexibility and zero-shot generalization to novel environments. In this work, we present MAG-3D, a training-free multi-agent framework for grounded 3D reasoning with off-the-shelf VLMs. Instead of relying on task-specific training or fixed reasoning procedures, MAG-3D dynamically coordinates expert agents to address the key challenges of 3D reasoning. Specifically, we propose a planning agent that decomposes the task and orchestrates the overall reasoning process, a grounding agent that performs free-form 3D grounding and relevant frame retrieval from extensive 3D scene observations, and a coding agent that conducts flexible geometric reasoning and explicit verification through executable programs. This multi-agent collaborative design enables flexible training-free 3D grounded reasoning across diverse scenes and achieves state-of-the-art performance on challenging benchmarks.

Problem

Research questions and friction points this paper is trying to address.

3D grounded reasoning

vision-language models

zero-shot generalization

spatial relationships

multi-agent reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-agent

3D grounded reasoning

training-free