DINOv3 Visual Representations for Blueberry Perception Toward Robotic Harvesting

📅 2026-03-02

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This study investigates the applicability of vision foundation models to perception tasks in blueberry harvesting robots, with a focus on critical challenges such as fruit and bruise segmentation, and detection of individual fruits and fruit clusters. Employing the self-supervised pre-trained DINOv3 model as a frozen backbone coupled with a lightweight unified decoder, we evaluate its performance without fine-tuning the backbone. Results demonstrate that DINOv3 achieves strong performance in segmentation tasks, consistently improving with model scale. However, its effectiveness in detection tasks is hindered by variations in object scale and spatial discretization, particularly revealing limitations in modeling spatial aggregation relationships for fruit cluster detection. This work highlights both the potential and the constraints of general-purpose vision foundation models in agricultural contexts, offering empirical insights for the design of future agricultural vision systems.

Technology Category

Application Category

📝 Abstract

Vision Foundation Models trained via large-scale self-supervised learning have demonstrated strong generalization in visual perception; however, their practical role and performance limits in agricultural settings remain insufficiently understood. This work evaluates DINOv3 as a frozen backbone for blueberry robotic harvesting-related visual tasks, including fruit and bruise segmentation, as well as fruit and cluster detection. Under a unified protocol with lightweight decoders, segmentation benefits consistently from stable patch-level representations and scales with backbone size. In contrast, detection is constrained by target scale variation, patch discretization, and localization compatibility. The failure of cluster detection highlights limitations in modeling relational targets defined by spatial aggregation. Overall, DINOv3 is best viewed not as an end-to-end task model, but as a semantic backbone whose effectiveness depends on downstream spatial modeling aligned with fruit-scale and aggregation structures, providing guidance for blueberry robotic harvesting. Code and dataset will be available upon acceptance.

Problem

Research questions and friction points this paper is trying to address.

blueberry harvesting

visual perception

fruit detection

bruise segmentation

vision foundation models

Innovation

Methods, ideas, or system contributions that make the work stand out.

DINOv3

foundation models

blueberry harvesting