Towards Visuospatial Cognition via Hierarchical Fusion of Visual Experts

📅 2025-05-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current multimodal large language models exhibit significant limitations in visual spatial cognition—particularly in spatial layout understanding, relational reasoning, and dynamic scene inference—largely due to inadequate spatially grounded representation architectures and insufficient high-quality training data. To address this, we propose ViCA2, a novel model featuring a dual-encoder decoupled architecture: SigLIP for semantic encoding and Hiera for hierarchical spatial-structural modeling, augmented with a token-ratio adaptive control mechanism. We further introduce ViCA-322K, the first large-scale spatially grounded instruction-tuning dataset comprising 322K question-answer pairs. With only 7B parameters, ViCA2 achieves a substantial leap in spatial reasoning capability, scoring 56.8 on VSI-Bench—outperforming LLaVA-NeXT-Video-72B (40.9) and Gemini-1.5 Pro (45.4). The model, training code, and ViCA-322K dataset are fully open-sourced.

Technology Category

Application Category

📝 Abstract
While Multimodal Large Language Models (MLLMs) excel at general vision-language tasks, visuospatial cognition - reasoning about spatial layouts, relations, and dynamics - remains a significant challenge. Existing models often lack the necessary architectural components and specialized training data for fine-grained spatial understanding. We introduce ViCA2 (Visuospatial Cognitive Assistant 2), a novel MLLM designed to enhance spatial reasoning. ViCA2 features a dual vision encoder architecture integrating SigLIP for semantics and Hiera for spatial structure, coupled with a token ratio control mechanism for efficiency. We also developed ViCA-322K, a new large-scale dataset with over 322,000 spatially grounded question-answer pairs for targeted instruction tuning. On the challenging VSI-Bench benchmark, our ViCA2-7B model achieves a state-of-the-art average score of 56.8, significantly surpassing larger open-source models (e.g., LLaVA-NeXT-Video-72B, 40.9) and leading proprietary models (Gemini-1.5 Pro, 45.4). This demonstrates the effectiveness of our approach in achieving strong visuospatial intelligence with a compact model. We release ViCA2, its codebase, and the ViCA-322K dataset to facilitate further research.
Problem

Research questions and friction points this paper is trying to address.

Enhancing spatial reasoning in multimodal language models
Addressing lack of specialized data for spatial understanding
Improving visuospatial cognition with compact model architecture
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual vision encoder integrates SigLIP and Hiera
Token ratio control mechanism enhances efficiency
ViCA-322K dataset enables targeted instruction tuning
🔎 Similar Papers
No similar papers found.
Q
Qi Feng
Kyoto University
Hidetoshi Shimodaira
Hidetoshi Shimodaira
Kyoto University
StatisticsMachine Learning
H
HieraImageProcessor SigLipImageProcessor SigLipVisionTower