ViewFusion: Structured Spatial Thinking Chains for Multi-View Reasoning

📅 2026-03-06

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Existing vision-language models often rely on single-view shortcuts in multi-view spatial reasoning, struggling to model cross-view relationships and exhibiting fragile performance under viewpoint changes or occlusions. This work proposes ViewFusion, a two-stage framework that first explicitly models cross-view spatial relations and transformations through spatial pre-thinking to construct a structured intermediate workspace, then performs question-driven reasoning over this representation to generate answers. ViewFusion is the first approach to decouple multi-view spatial reasoning into explicit spatial alignment and question-answering stages, introducing a structured spatial chain-of-thought to circumvent single-view shortcuts. Combined with synthetic reasoning supervision during pretraining and GRPO reinforcement learning, the method effectively guides the model to stably execute the two-stage reasoning process. On MMSI-Bench, ViewFusion outperforms Qwen3-VL-4B-Instruct by 5.3%, with particularly pronounced gains on samples requiring genuine cross-view alignment.

Technology Category

Application Category

📝 Abstract

Multi-view spatial reasoning remains difficult for current vision-language models. Even when multiple viewpoints are available, models often underutilize cross-view relations and instead rely on single-image shortcuts, leading to fragile performance on viewpoint transformation and occlusion-sensitive cases. We present ViewFusion, a two-stage framework that explicitly separates cross-view spatial pre-alignment from question answering. In the first stage, the model performs deliberate spatial pre-thinking to infer viewpoint relations and spatial transformations across views, forming an intermediate workspace that goes beyond a simple re-description. In the second stage, the model conducts question-driven reasoning conditioned on this workspace to produce the final prediction. We train ViewFusion with synthetic reasoning supervision followed by reinforcement learning using GRPO, which improves answer correctness while stabilizing the intended two-stage generation behavior. On MMSI-Bench, ViewFusion improves accuracy by 5.3\% over Qwen3-VL-4B-Instruct, with the largest gains on examples that require genuine cross-view alignment.

Problem

Research questions and friction points this paper is trying to address.

multi-view reasoning

spatial reasoning

vision-language models

viewpoint transformation

occlusion

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-view reasoning

spatial pre-alignment

structured thinking chains