🤖 AI Summary
This study investigates the applicability of vision foundation models to perception tasks in blueberry harvesting robots, with a focus on critical challenges such as fruit and bruise segmentation, and detection of individual fruits and fruit clusters. Employing the self-supervised pre-trained DINOv3 model as a frozen backbone coupled with a lightweight unified decoder, we evaluate its performance without fine-tuning the backbone. Results demonstrate that DINOv3 achieves strong performance in segmentation tasks, consistently improving with model scale. However, its effectiveness in detection tasks is hindered by variations in object scale and spatial discretization, particularly revealing limitations in modeling spatial aggregation relationships for fruit cluster detection. This work highlights both the potential and the constraints of general-purpose vision foundation models in agricultural contexts, offering empirical insights for the design of future agricultural vision systems.
📝 Abstract
Vision Foundation Models trained via large-scale self-supervised learning have demonstrated strong generalization in visual perception; however, their practical role and performance limits in agricultural settings remain insufficiently understood. This work evaluates DINOv3 as a frozen backbone for blueberry robotic harvesting-related visual tasks, including fruit and bruise segmentation, as well as fruit and cluster detection. Under a unified protocol with lightweight decoders, segmentation benefits consistently from stable patch-level representations and scales with backbone size. In contrast, detection is constrained by target scale variation, patch discretization, and localization compatibility. The failure of cluster detection highlights limitations in modeling relational targets defined by spatial aggregation. Overall, DINOv3 is best viewed not as an end-to-end task model, but as a semantic backbone whose effectiveness depends on downstream spatial modeling aligned with fruit-scale and aggregation structures, providing guidance for blueberry robotic harvesting. Code and dataset will be available upon acceptance.