Intrinsic Explainability of Multimodal Learning for Crop Yield Prediction

📅 2025-08-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses sub-field-scale crop yield prediction, tackling the challenge of model interpretability in multimodal data fusion—including multispectral satellite imagery, meteorological time series, terrain elevation, and soil properties. We propose Weighted Modality Activation (WMA) to quantify the dynamic contribution of each modality and Attention Propagation (AR) to enable robust temporal feature attribution. Agricultural domain knowledge is integrated to validate explanations, enhancing both credibility and agronomic applicability. Compared to baselines, our model improves R² by 0.10 at the sub-field level and 0.04 at the field level. AR significantly outperforms Generic Attention and Shapley Value Sampling in temporal attribution. The core innovation lies in the deep synergy between Transformer’s intrinsic attention mechanism and purpose-built interpretability design—enabling, for the first time, dual-granularity attribution at both modality-level and time-step-level in multimodal agricultural forecasting.

Technology Category

Application Category

📝 Abstract
Multimodal learning enables various machine learning tasks to benefit from diverse data sources, effectively mimicking the interplay of different factors in real-world applications, particularly in agriculture. While the heterogeneous nature of involved data modalities may necessitate the design of complex architectures, the model interpretability is often overlooked. In this study, we leverage the intrinsic explainability of Transformer-based models to explain multimodal learning networks, focusing on the task of crop yield prediction at the subfield level. The large datasets used cover various crops, regions, and years, and include four different input modalities: multispectral satellite and weather time series, terrain elevation maps and soil properties. Based on the self-attention mechanism, we estimate feature attributions using two methods, namely the Attention Rollout (AR) and Generic Attention (GA), and evaluate their performance against Shapley-based model-agnostic estimations, Shapley Value Sampling (SVS). Additionally, we propose the Weighted Modality Activation (WMA) method to assess modality attributions and compare it with SVS attributions. Our findings indicate that Transformer-based models outperform other architectures, specifically convolutional and recurrent networks, achieving R2 scores that are higher by 0.10 and 0.04 at the subfield and field levels, respectively. AR is shown to provide more robust and reliable temporal attributions, as confirmed through qualitative and quantitative evaluation, compared to GA and SVS values. Information about crop phenology stages was leveraged to interpret the explanation results in the light of established agronomic knowledge. Furthermore, modality attributions revealed varying patterns across the two methods compared.[...]
Problem

Research questions and friction points this paper is trying to address.

Explain multimodal learning for crop yield prediction
Assess feature and modality attribution methods
Compare Transformer models with other architectures
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based models for crop yield prediction
Attention Rollout for robust temporal attributions
Weighted Modality Activation for modality attributions
🔎 Similar Papers
No similar papers found.
Hiba Najjar
Hiba Najjar
PhD Candidate at RPTU, Research assistant at DFKI
AIExplainable AIRemote SensingAgriculture
D
Deepak Pathak
German Research Center for Artificial Intelligence (DFKI), Kaiserslautern, Germany
M
Marlon Nuske
German Research Center for Artificial Intelligence (DFKI), Kaiserslautern, Germany; Bundesanstalt für Landwirtschaft und Ernährung, Bonn, Germany
Andreas Dengel
Andreas Dengel
Professor of Computer Science, University of Kaiserslautern & Executive Director, DFKI
Artificial IntelligenceMachine LearningDocument AnalysisSemantic Technologies