🤖 AI Summary
Current multimodal large language models exhibit significant limitations in visual spatial cognition—particularly in spatial layout understanding, relational reasoning, and dynamic scene inference—largely due to inadequate spatially grounded representation architectures and insufficient high-quality training data. To address this, we propose ViCA2, a novel model featuring a dual-encoder decoupled architecture: SigLIP for semantic encoding and Hiera for hierarchical spatial-structural modeling, augmented with a token-ratio adaptive control mechanism. We further introduce ViCA-322K, the first large-scale spatially grounded instruction-tuning dataset comprising 322K question-answer pairs. With only 7B parameters, ViCA2 achieves a substantial leap in spatial reasoning capability, scoring 56.8 on VSI-Bench—outperforming LLaVA-NeXT-Video-72B (40.9) and Gemini-1.5 Pro (45.4). The model, training code, and ViCA-322K dataset are fully open-sourced.
📝 Abstract
While Multimodal Large Language Models (MLLMs) excel at general vision-language tasks, visuospatial cognition - reasoning about spatial layouts, relations, and dynamics - remains a significant challenge. Existing models often lack the necessary architectural components and specialized training data for fine-grained spatial understanding. We introduce ViCA2 (Visuospatial Cognitive Assistant 2), a novel MLLM designed to enhance spatial reasoning. ViCA2 features a dual vision encoder architecture integrating SigLIP for semantics and Hiera for spatial structure, coupled with a token ratio control mechanism for efficiency. We also developed ViCA-322K, a new large-scale dataset with over 322,000 spatially grounded question-answer pairs for targeted instruction tuning. On the challenging VSI-Bench benchmark, our ViCA2-7B model achieves a state-of-the-art average score of 56.8, significantly surpassing larger open-source models (e.g., LLaVA-NeXT-Video-72B, 40.9) and leading proprietary models (Gemini-1.5 Pro, 45.4). This demonstrates the effectiveness of our approach in achieving strong visuospatial intelligence with a compact model. We release ViCA2, its codebase, and the ViCA-322K dataset to facilitate further research.