Intrinsic Explainability of Multimodal Learning for Crop Yield Prediction

📅 2025-08-09

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This study addresses sub-field-scale crop yield prediction, tackling the challenge of model interpretability in multimodal data fusion—including multispectral satellite imagery, meteorological time series, terrain elevation, and soil properties. We propose Weighted Modality Activation (WMA) to quantify the dynamic contribution of each modality and Attention Propagation (AR) to enable robust temporal feature attribution. Agricultural domain knowledge is integrated to validate explanations, enhancing both credibility and agronomic applicability. Compared to baselines, our model improves R² by 0.10 at the sub-field level and 0.04 at the field level. AR significantly outperforms Generic Attention and Shapley Value Sampling in temporal attribution. The core innovation lies in the deep synergy between Transformer’s intrinsic attention mechanism and purpose-built interpretability design—enabling, for the first time, dual-granularity attribution at both modality-level and time-step-level in multimodal agricultural forecasting.

Technology Category

Application Category

📝 Abstract

Multimodal learning enables various machine learning tasks to benefit from diverse data sources, effectively mimicking the interplay of different factors in real-world applications, particularly in agriculture. While the heterogeneous nature of involved data modalities may necessitate the design of complex architectures, the model interpretability is often overlooked. In this study, we leverage the intrinsic explainability of Transformer-based models to explain multimodal learning networks, focusing on the task of crop yield prediction at the subfield level. The large datasets used cover various crops, regions, and years, and include four different input modalities: multispectral satellite and weather time series, terrain elevation maps and soil properties. Based on the self-attention mechanism, we estimate feature attributions using two methods, namely the Attention Rollout (AR) and Generic Attention (GA), and evaluate their performance against Shapley-based model-agnostic estimations, Shapley Value Sampling (SVS). Additionally, we propose the Weighted Modality Activation (WMA) method to assess modality attributions and compare it with SVS attributions. Our findings indicate that Transformer-based models outperform other architectures, specifically convolutional and recurrent networks, achieving R2 scores that are higher by 0.10 and 0.04 at the subfield and field levels, respectively. AR is shown to provide more robust and reliable temporal attributions, as confirmed through qualitative and quantitative evaluation, compared to GA and SVS values. Information about crop phenology stages was leveraged to interpret the explanation results in the light of established agronomic knowledge. Furthermore, modality attributions revealed varying patterns across the two methods compared.[...]

Problem

Research questions and friction points this paper is trying to address.

Explain multimodal learning for crop yield prediction

Assess feature and modality attribution methods

Compare Transformer models with other architectures

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based models for crop yield prediction

Attention Rollout for robust temporal attributions

Weighted Modality Activation for modality attributions

🔎 Similar Papers

No similar papers found.