GeoViSTA: Geospatial Vision-Tabular Transformer for Multimodal Environment Representation

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

Existing geospatial foundation models struggle to effectively integrate remote sensing imagery with structured socioeconomic tabular data, limiting their capacity for holistic modeling of environment–society–health interdependencies. This work proposes GeoViSTA, a novel vision-tabular Transformer architecture that, for the first time, enables alignment and joint modeling of co-registered remote sensing images and irregular census tract tabular data. Its core innovations include a geography-aware bidirectional cross-attention mechanism and a cross-modal masked autoencoding pretraining strategy, which together learn unified, comprehensive multimodal geospatial embeddings. Evaluated on downstream tasks such as disease mortality and wildfire frequency prediction, GeoViSTA substantially outperforms existing baselines, demonstrating strong transferability and generalization capabilities.

📝 Abstract

Large-scale pretraining on Earth observation imagery has yielded powerful representations of the natural and built environment. However, most existing geospatial foundation models do not directly model the structured socioeconomic covariates typically stored in tabular form. This modality gap limits their ability to capture the complete total environment, which is critical for reasoning about complex environmental, social, and health-related outcomes. In this work, we propose GeoViSTA (Geospatial Vision-Tabular Transformer), a vision-tabular architecture that learns unified geospatial embeddings from co-registered gridded imagery and tabular data. GeoViSTA utilizes bilateral cross-attention to exchange spatial and semantic information across modalities, guided by a geography-aware attention mechanism that aligns continuous image patches with irregular census-tract tokens. We train GeoViSTA with a self-supervised joint masked-autoencoding objective, forcing it to recover missing image patches and tabular rows using local spatial context and cross-modal cues. Empirically, GeoViSTA's unified embeddings improve linear probing performance on high-impact downstream tasks, outperforming baselines in predicting disease-specific mortality and fire hazard frequency across held-out regions. These results demonstrate that jointly modeling the physical environment alongside structured socioeconomic context yields highly transferable representations for holistic geospatial inference.

Problem

Research questions and friction points this paper is trying to address.

geospatial foundation models

multimodal representation

vision-tabular fusion

socioeconomic covariates

environmental representation

Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-tabular fusion

geospatial foundation model

cross-modal attention