3D Reconstruction and Knowledge Distillation to Improve Multi-View Image Models to Explore Spike Volume Estimation in Wheat

📅 2026-05-20
📈 Citations: 0
Influential: 0
📄 PDF

career value

203K/year
🤖 AI Summary
Accurate estimation of wheat spike volume in field conditions is hindered by the susceptibility of existing 3D sensing to plant motion and poor adaptability to outdoor environments, while purely 2D models lack essential geometric information. To address these limitations, this work proposes a hybrid architecture that integrates a pose-robust 3D point cloud network with a multi-view image Transformer. Through feature- and label-level knowledge distillation, geometric awareness from the 3D model is effectively transferred to a lightweight 2D counterpart. The distilled model achieves significantly improved accuracy and efficiency, reducing the mean absolute error to 639.93 mm³ with a correlation coefficient of 0.82, while decreasing inference time per spike from 160 ms to 1.4 ms and mitigating volume-dependent bias.
📝 Abstract
Accurate estimation of wheat spike volume is important for yield component analysis and stress resilience assessment, yet field-based measurement remains challenging. Active 3D sensing methods such as Light Detection and Ranging (LiDAR) or time-of-flight (ToF) are sensitive to plant motion or poorly suited to outdoor conditions, while 3D reconstructions are computationally expensive. Direct 2D image processing would offer computational advantages, but image-based models lack explicit geometric information. We therefore propose a hybrid 2D-3D approach with knowledge distillation during training while enabling efficient image-only inference. First, we train a rigid-invariant point cloud network using distance-based histogram features to obtain pose-robust geometric representations. We then combine the 3D model with a proposed multi-view image-based regulated Transformer (RT) in an ensemble architecture. Finally, we distill the ensemble knowledge into a purely image-based student model using either feature-based or label-based distillation. The two distilled RTs reduce the mean absolute error (MAE) from 654.31 mm$^3$ of the non-distilled RT to 639.93 mm$^3$ and 644.62 mm$^3$, and increase correlation from 0.76 to 0.77 and 0.82, respectively. At the same time, inference time is reduced from 160 ms to 1.4 ms per spike. Distillation further mitigates volume-dependent bias and reshapes the latent representation of the image model toward a geometry-aware shape. Our results demonstrate that 3D-informed training of a 2D Transformer allows for scalable and efficient spike volume estimation for high-throughput field phenotyping.
Problem

Research questions and friction points this paper is trying to address.

spike volume estimation
3D reconstruction
knowledge distillation
multi-view image models
field phenotyping
Innovation

Methods, ideas, or system contributions that make the work stand out.

knowledge distillation
3D reconstruction
multi-view Transformer
wheat spike volume estimation
geometry-aware representation
🔎 Similar Papers
No similar papers found.