🤖 AI Summary
Near-infrared (NIR) soil spectral libraries are typically small and suffer from poor generalizability, whereas mid-infrared (MIR) spectra are abundant but cannot be directly leveraged for low-cost NIR-based soil property prediction. Method: We propose a self-supervised multi-fidelity learning framework based on a variational autoencoder (VAE), jointly modeling NIR and MIR spectra in a shared latent space. It performs unsupervised pretraining on large-scale unlabeled spectral data, fine-tunes the encoder using limited paired NIR–MIR samples, and freezes the pretrained decoder to enable cross-band mapping—transferring high-fidelity MIR knowledge to NIR prediction. Finally, regression models link the learned spectral embeddings to nine soil properties. Contribution/Results: All property predictions outperform baseline methods; critically, MIR-guided NIR prediction achieves substantial accuracy gains, effectively alleviating the data scarcity bottleneck inherent to NIR spectroscopy.
📝 Abstract
We propose a self-supervised machine learning (SSML) framework for multi-fidelity learning and extended predictive soil spectroscopy based on latent space embeddings. A self-supervised representation was pretrained with the large MIR spectral library and the Variational Autoencoder algorithm to obtain a compressed latent space for generating spectral embeddings. At this stage, only unlabeled spectral data were used, allowing us to leverage the full spectral database and the availability of scan repeats for augmented training. We also leveraged and froze the trained MIR decoder for a spectrum conversion task by plugging it into a NIR encoder to learn the mapping between NIR and MIR spectra in an attempt to leverage the predictive capabilities contained in the large MIR library with a low cost portable NIR scanner. This was achieved by using a smaller subset of the KSSL library with paired NIR and MIR spectra. Downstream machine learning models were then trained to map between original spectra, predicted spectra, and latent space embeddings for nine soil properties. The performance of was evaluated independently of the KSSL training data using a gold-standard test set, along with regression goodness-of-fit metrics. Compared to baseline models, the proposed SSML and its embeddings yielded similar or better accuracy in all soil properties prediction tasks. Predictions derived from the spectrum conversion (NIR to MIR) task did not match the performance of the original MIR spectra but were similar or superior to predictive performance of NIR-only models, suggesting the unified spectral latent space can effectively leverage the larger and more diverse MIR dataset for prediction of soil properties not well represented in current NIR libraries.