Transformer-Based Spatio-Temporal Association of Apple Fruitlets

📅 2025-03-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Accurate spatiotemporal association of young apples across multi-day, multi-view stereo image sequences in orchards remains challenging due to their small scale, low-point-cloud resolution, and unstable visual appearance. Method: We propose the first lightweight Transformer architecture specifically designed for young apple detection and tracking. It jointly encodes shape and spatial position features via alternating self-attention and cross-attention mechanisms, enabling iterative refinement. Integrated with stereo vision feature extraction and an end-to-end trainable spatiotemporal matching framework, it explicitly models inter-frame and inter-view correspondences. Contribution/Results: Evaluated on a real-world commercial orchard dataset, our method achieves a 92.4% F1-score—outperforming all existing baselines and ablation variants. It establishes a new paradigm for fine-grained fruit tracking in agriculture, offering robustness to occlusion, viewpoint variation, and developmental morphological changes.

Technology Category

Application Category

📝 Abstract
In this paper, we present a transformer-based method to spatio-temporally associate apple fruitlets in stereo-images collected on different days and from different camera poses. State-of-the-art association methods in agriculture are dedicated towards matching larger crops using either high-resolution point clouds or temporally stable features, which are both difficult to obtain for smaller fruit in the field. To address these challenges, we propose a transformer-based architecture that encodes the shape and position of each fruitlet, and propagates and refines these features through a series of transformer encoder layers with alternating self and cross-attention. We demonstrate that our method is able to achieve an F1-score of 92.4% on data collected in a commercial apple orchard and outperforms all baselines and ablations.
Problem

Research questions and friction points this paper is trying to address.

Associates apple fruitlets across stereo-images from different days and camera poses.
Overcomes challenges in matching smaller fruits using high-resolution data.
Proposes a transformer-based architecture for encoding and refining fruitlet features.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based spatio-temporal association method
Encodes shape and position of apple fruitlets
Uses alternating self and cross-attention layers
🔎 Similar Papers
No similar papers found.