GeoWorld-VLM: Geometry from World Models for Vision-Language Models

📅 2026-05-15

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work addresses the fragility of current vision-language models (VLMs) in reasoning about fundamental spatial relationships, which stems primarily from the loss of 3D geometric structure during visual feature extraction. To mitigate this, the authors propose GeoWorld-VLM, which leverages a frozen camera-conditioned video world model as a geometric teacher to distill spatial knowledge via multi-view synthesis signals. Only the image encoder and multimodal projector are fine-tuned, while the language model remains frozen to preserve its semantic capabilities. Within a teacher-student framework, the approach integrates spatial answer supervision, intermediate feature alignment, and an anchor loss based on the original VLM. Evaluated on the What'sUp and VSR benchmarks, GeoWorld-VLM achieves consistent improvements of approximately 4%, demonstrating its effectiveness and generalizability across diverse VLM architectures.

📝 Abstract

Modern Vision-Language Models (VLMs) achieve strong semantic recognition, yet remain brittle on elementary spatial relations such as left of, on, behind, and between. One cause of this failure arises before language reasoning begins: the visual pathway may compress or discard critical 3D structural cues during feature extraction, so the language model receives image representations that are already insufficient for reliable spatial judgment. We introduce GeoWorld-VLM, a VLM-side distillation framework that transfers geometric structure from frozen camera-conditioned video world models into VLMs. GeoWorld-VLM fine-tunes only the image encoder and multimodal projector, aligning post-projector image features with intermediate world-model representations while leaving the main backbone frozen. Given images, a prompt, and a sampled camera trajectory, the world-model teacher converts static visual input into a synthetic multi-view spatial signal. Training combines spatial answer supervision, teacher-student feature alignment, and a preservation anchor to the original VLM. Since the language model remains frozen, GeoWorld-VLM preserves the original model's linguistic capabilities while attributing spatial improvements to the enhanced visual pathway. To evaluate the effectiveness and generality of the proposed method, we apply GeoWorld-VLM to two distinct VLM architectures and observe consistent improvements across both backbones. GeoWorld-VLM improves performance by approximately 4 percent on both the What'sUp and VSR benchmarks, suggesting that world-model-guided visual alignment generalizes across model structures and spatial reasoning datasets.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models

spatial reasoning

3D geometry

visual representation

world models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language Models

Spatial Reasoning

World Models