SpatiO: Adaptive Test-Time Orchestration of Vision-Language Agents for Spatial Reasoning

📅 2026-04-22

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Existing vision-language models struggle to adaptively integrate multimodal spatial cues for spatial reasoning tasks, particularly under distribution shifts. This work proposes a heterogeneous multi-agent framework equipped with a Test-Time Orchestration (TTO) mechanism that dynamically evaluates and reweights vision-language experts with complementary inductive biases—without updating model parameters—to enable context-aware spatial reasoning. By synergistically combining heterogeneous architectures, dynamic weight allocation, and explicit spatial relationship modeling, the method significantly outperforms both open- and closed-source baselines across multiple benchmarks, including 3DSRBench, STVQA-7k, CV-Bench, and Omni3D-Bench, demonstrating superior spatial adaptability and generalization capability.

Technology Category

Application Category

📝 Abstract

Understanding visual scenes requires not only recognizing objects but also reasoning about their spatial relationships. Unlike general vision-language tasks, spatial reasoning requires integrating multiple inductive biases, such as 2D appearance cues, depth signals, and geometric constraints, whose reliability varies across contexts. This suggests that effective spatial reasoning requires \emph{spatial adaptability}: the ability to flexibly coordinate different reasoning strategies depending on the input. However, most existing approaches rely on a single reasoning pipeline that implicitly learns a fixed spatial prior, limiting their ability to adapt under distribution changes. Multi-agent systems offer a promising alternative by aggregating diverse reasoning trajectories, but prior attempts in spatial reasoning primarily employ homogeneous agents, restricting the diversity of inductive biases they can leverage. In this work, we introduce \textbf{\textsc{SpatiO}}, a heterogeneous multi-agent framework for spatial reasoning that coordinates multiple vision-language specialists with complementary inductive biases. To enable effective collaboration, we propose \textbf{Test-Time Orchestration (TTO)}, an optimization mechanism that dynamically evaluates and reweights agents based on their observed reliability during inference, without modifying model parameters. Extensive experiments on diverse spatial reasoning benchmarks, including 3DSRBench, STVQA-7k, CV-Bench, and Omni3D-Bench, demonstrate that \textsc{SpatiO} consistently improves spatial reasoning performance over both closed-source and open-source baselines.

Problem

Research questions and friction points this paper is trying to address.

spatial reasoning

vision-language agents

inductive biases

adaptability

heterogeneous multi-agent

Innovation

Methods, ideas, or system contributions that make the work stand out.

heterogeneous multi-agent

spatial reasoning

test-time orchestration