🤖 AI Summary
Accurate spatiotemporal association of young apples across multi-day, multi-view stereo image sequences in orchards remains challenging due to their small scale, low-point-cloud resolution, and unstable visual appearance.
Method: We propose the first lightweight Transformer architecture specifically designed for young apple detection and tracking. It jointly encodes shape and spatial position features via alternating self-attention and cross-attention mechanisms, enabling iterative refinement. Integrated with stereo vision feature extraction and an end-to-end trainable spatiotemporal matching framework, it explicitly models inter-frame and inter-view correspondences.
Contribution/Results: Evaluated on a real-world commercial orchard dataset, our method achieves a 92.4% F1-score—outperforming all existing baselines and ablation variants. It establishes a new paradigm for fine-grained fruit tracking in agriculture, offering robustness to occlusion, viewpoint variation, and developmental morphological changes.
📝 Abstract
In this paper, we present a transformer-based method to spatio-temporally associate apple fruitlets in stereo-images collected on different days and from different camera poses. State-of-the-art association methods in agriculture are dedicated towards matching larger crops using either high-resolution point clouds or temporally stable features, which are both difficult to obtain for smaller fruit in the field. To address these challenges, we propose a transformer-based architecture that encodes the shape and position of each fruitlet, and propagates and refines these features through a series of transformer encoder layers with alternating self and cross-attention. We demonstrate that our method is able to achieve an F1-score of 92.4% on data collected in a commercial apple orchard and outperforms all baselines and ablations.