Deep Supervised LSTM for 3D morphology estimation from Multi-View RGB Images of Wheat Spikes

📅 2025-06-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing challenges of depth ambiguity, projection distortion, and field occlusion in estimating wheat ear 3D morphology from multi-view RGB images, this paper proposes a self-supervised Vision Transformer–LSTM fusion framework. Methodologically, DINOv2 is employed to extract robust visual features; a unidirectional LSTM models temporal dependencies across views; and a depth supervision mechanism enhances intermediate representation learning. Ground truth is obtained via structured-light 3D scanning. Our key contribution is the first integration of DINOv2 with LSTM for end-to-end, non-invasive plant 3D volume prediction. On a six-view indoor dataset, the method achieves a MAPE of 6.46%, substantially outperforming area-based (9.36%) and geometric reconstruction (13.98%) baselines. After fine-tuning on single-field images, MAPE remains at 10.82%, demonstrating strong generalization to complex in-field morphologies.

Technology Category

Application Category

📝 Abstract
Estimating three-dimensional morphological traits from two-dimensional RGB images presents inherent challenges due to the loss of depth information, projection distortions, and occlusions under field conditions. In this work, we explore multiple approaches for non-destructive volume estimation of wheat spikes, using RGB image sequences and structured-light 3D scans as ground truth references. Due to the complex geometry of the spikes, we propose a neural network approach for volume estimation in 2D images, employing a transfer learning pipeline that combines DINOv2, a self-supervised Vision Transformer, with a unidirectional Long Short-Term Memory (LSTM) network. By using deep supervision, the model is able to learn more robust intermediate representations, which enhances its generalisation ability across varying evaluation sequences. We benchmark our model against two conventional baselines: a 2D area-based projection and a geometric reconstruction using axis-aligned cross-sections. Our deep supervised model achieves a mean absolute percentage error (MAPE) of 6.46% on six-view indoor images, outperforming the area (9.36%) and geometric (13.98%) baselines. Fine-tuning the model on field-based single-image data enables domain adaptation, yielding a MAPE of 10.82%. We demonstrate that object shape significantly impacts volume prediction accuracy, with irregular geometries such as wheat spikes posing greater challenges for geometric methods compared to our deep learning approach.
Problem

Research questions and friction points this paper is trying to address.

Estimating 3D wheat spike morphology from 2D RGB images
Overcoming depth loss and occlusion in field conditions
Improving volume prediction accuracy for complex geometries
Innovation

Methods, ideas, or system contributions that make the work stand out.

Deep supervised LSTM for 3D estimation
Transfer learning with DINOv2 and LSTM
Multi-view RGB images for volume prediction
🔎 Similar Papers
No similar papers found.
O
Olivia Zumsteg
ETH Zurich, Institute of Agricultural Science, Universitätsstrasse 2, Zurich, 8092, Switzerland
N
Nico Graf
ETH Zurich, Department of Computer Science, Universitätsstrasse 6, Zurich, 8902, Switzerland
A
Aaron Haeusler
ETH Zurich, Department of Mechanical and Process Engineering, Leonhardstrasse 21, Zurich, 8092, Switzerland
N
Norbert Kirchgessner
ETH Zurich, Institute of Agricultural Science, Universitätsstrasse 2, Zurich, 8092, Switzerland
N
Nicola Storni
ETH Zurich, Institute of Agricultural Science, Universitätsstrasse 2, Zurich, 8092, Switzerland
Lukas Roth
Lukas Roth
ETH Zürich
Andreas Hund
Andreas Hund
Group of Crop Science, ETH Zurich
physiological breedingcrop phenotyingwheat