GeoViSTA: Geospatial Vision-Tabular Transformer for Multimodal Environment Representation

📅 2026-05-14
📈 Citations: 0
Influential: 0
📄 PDF

career value

159K/year
🤖 AI Summary
Existing geospatial foundation models struggle to effectively integrate remote sensing imagery with structured socioeconomic tabular data, limiting their capacity for holistic modeling of environment–society–health interdependencies. This work proposes GeoViSTA, a novel vision-tabular Transformer architecture that, for the first time, enables alignment and joint modeling of co-registered remote sensing images and irregular census tract tabular data. Its core innovations include a geography-aware bidirectional cross-attention mechanism and a cross-modal masked autoencoding pretraining strategy, which together learn unified, comprehensive multimodal geospatial embeddings. Evaluated on downstream tasks such as disease mortality and wildfire frequency prediction, GeoViSTA substantially outperforms existing baselines, demonstrating strong transferability and generalization capabilities.
📝 Abstract
Large-scale pretraining on Earth observation imagery has yielded powerful representations of the natural and built environment. However, most existing geospatial foundation models do not directly model the structured socioeconomic covariates typically stored in tabular form. This modality gap limits their ability to capture the complete total environment, which is critical for reasoning about complex environmental, social, and health-related outcomes. In this work, we propose GeoViSTA (Geospatial Vision-Tabular Transformer), a vision-tabular architecture that learns unified geospatial embeddings from co-registered gridded imagery and tabular data. GeoViSTA utilizes bilateral cross-attention to exchange spatial and semantic information across modalities, guided by a geography-aware attention mechanism that aligns continuous image patches with irregular census-tract tokens. We train GeoViSTA with a self-supervised joint masked-autoencoding objective, forcing it to recover missing image patches and tabular rows using local spatial context and cross-modal cues. Empirically, GeoViSTA's unified embeddings improve linear probing performance on high-impact downstream tasks, outperforming baselines in predicting disease-specific mortality and fire hazard frequency across held-out regions. These results demonstrate that jointly modeling the physical environment alongside structured socioeconomic context yields highly transferable representations for holistic geospatial inference.
Problem

Research questions and friction points this paper is trying to address.

geospatial foundation models
multimodal representation
vision-tabular fusion
socioeconomic covariates
environmental representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-tabular fusion
geospatial foundation model
cross-modal attention
masked autoencoding
multimodal representation learning