YieldSAT: A Multimodal Benchmark Dataset for High-Resolution Crop Yield Prediction

📅 2026-04-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the limitations of existing crop yield prediction datasets, which are typically small-scale, low-quality, and restricted to specific regions or crop types, thereby hindering scalable data-driven modeling. To overcome these challenges, the authors construct and release the first large-scale, high-resolution (10-meter) public benchmark dataset spanning multiple countries, major crop types, and diverse climate zones. The dataset comprises over 12.2 million pixel-level yield samples and more than 110,000 multispectral satellite images, augmented with auxiliary environmental data. Building upon this resource, the work proposes a domain-aware deep ensemble method that formulates yield prediction as a pixel-level regression task, effectively mitigating distribution shifts inherent in real-world scenarios. Experiments demonstrate consistent and significant performance gains across multiple deep learning architectures, confirming the feasibility of high-resolution yield prediction.
📝 Abstract
Crop yield prediction requires substantial data to train scalable models. However, creating yield prediction datasets is constrained by high acquisition costs, heterogeneous data quality, and data privacy regulations. Consequently, existing datasets are scarce, low in quality, or limited to regional levels or single crop types, hindering the development of scalable data-driven solutions. In this work, we release YieldSAT, a large, high-quality, and multimodal dataset for high-resolution crop yield prediction. YieldSAT spans various climate zones across multiple countries, including Argentina, Brazil, Uruguay, and Germany, and includes major crop types, including corn, rapeseed, soybeans, and wheat, across 2,173 expert-curated fields. In total, over 12.2 million yield samples are available, each with a spatial resolution of 10 m. Each field is paired with multispectral satellite imagery, resulting in 113,555 labeled satellite images, complemented by auxiliary environmental data. We demonstrate the potential of large-scale and high-resolution crop yield prediction as a pixel regression task by comparing various deep learning models and data fusion architectures. Furthermore, we highlight open challenges arising from severe distribution shifts in the ground truth data under real-world conditions. To mitigate this, we explore a domain-informed Deep Ensemble approach that exhibits significant performance gains. The dataset is available at https://yieldsat.github.io/.
Problem

Research questions and friction points this paper is trying to address.

crop yield prediction
dataset scarcity
data quality
multimodal data
scalable modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal dataset
high-resolution yield prediction
pixel regression
domain-informed Deep Ensemble
distribution shift
🔎 Similar Papers
No similar papers found.