3D Reconstruction and Knowledge Distillation to Improve Multi-View Image Models to Explore Spike Volume Estimation in Wheat

📅 2026-05-20

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Accurate estimation of wheat spike volume in field conditions is hindered by the susceptibility of existing 3D sensing to plant motion and poor adaptability to outdoor environments, while purely 2D models lack essential geometric information. To address these limitations, this work proposes a hybrid architecture that integrates a pose-robust 3D point cloud network with a multi-view image Transformer. Through feature- and label-level knowledge distillation, geometric awareness from the 3D model is effectively transferred to a lightweight 2D counterpart. The distilled model achieves significantly improved accuracy and efficiency, reducing the mean absolute error to 639.93 mm³ with a correlation coefficient of 0.82, while decreasing inference time per spike from 160 ms to 1.4 ms and mitigating volume-dependent bias.

📝 Abstract

Accurate estimation of wheat spike volume is important for yield component analysis and stress resilience assessment, yet field-based measurement remains challenging. Active 3D sensing methods such as Light Detection and Ranging (LiDAR) or time-of-flight (ToF) are sensitive to plant motion or poorly suited to outdoor conditions, while 3D reconstructions are computationally expensive. Direct 2D image processing would offer computational advantages, but image-based models lack explicit geometric information. We therefore propose a hybrid 2D-3D approach with knowledge distillation during training while enabling efficient image-only inference. First, we train a rigid-invariant point cloud network using distance-based histogram features to obtain pose-robust geometric representations. We then combine the 3D model with a proposed multi-view image-based regulated Transformer (RT) in an ensemble architecture. Finally, we distill the ensemble knowledge into a purely image-based student model using either feature-based or label-based distillation. The two distilled RTs reduce the mean absolute error (MAE) from 654.31 mm$^3$ of the non-distilled RT to 639.93 mm$^3$ and 644.62 mm$^3$, and increase correlation from 0.76 to 0.77 and 0.82, respectively. At the same time, inference time is reduced from 160 ms to 1.4 ms per spike. Distillation further mitigates volume-dependent bias and reshapes the latent representation of the image model toward a geometry-aware shape. Our results demonstrate that 3D-informed training of a 2D Transformer allows for scalable and efficient spike volume estimation for high-throughput field phenotyping.

Problem

Research questions and friction points this paper is trying to address.

spike volume estimation

3D reconstruction

knowledge distillation

multi-view image models

field phenotyping

Innovation

Methods, ideas, or system contributions that make the work stand out.

knowledge distillation

3D reconstruction

multi-view Transformer