DINOv3 Visual Representations for Blueberry Perception Toward Robotic Harvesting

📅 2026-03-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the applicability of vision foundation models to perception tasks in blueberry harvesting robots, with a focus on critical challenges such as fruit and bruise segmentation, and detection of individual fruits and fruit clusters. Employing the self-supervised pre-trained DINOv3 model as a frozen backbone coupled with a lightweight unified decoder, we evaluate its performance without fine-tuning the backbone. Results demonstrate that DINOv3 achieves strong performance in segmentation tasks, consistently improving with model scale. However, its effectiveness in detection tasks is hindered by variations in object scale and spatial discretization, particularly revealing limitations in modeling spatial aggregation relationships for fruit cluster detection. This work highlights both the potential and the constraints of general-purpose vision foundation models in agricultural contexts, offering empirical insights for the design of future agricultural vision systems.

Technology Category

Application Category

📝 Abstract
Vision Foundation Models trained via large-scale self-supervised learning have demonstrated strong generalization in visual perception; however, their practical role and performance limits in agricultural settings remain insufficiently understood. This work evaluates DINOv3 as a frozen backbone for blueberry robotic harvesting-related visual tasks, including fruit and bruise segmentation, as well as fruit and cluster detection. Under a unified protocol with lightweight decoders, segmentation benefits consistently from stable patch-level representations and scales with backbone size. In contrast, detection is constrained by target scale variation, patch discretization, and localization compatibility. The failure of cluster detection highlights limitations in modeling relational targets defined by spatial aggregation. Overall, DINOv3 is best viewed not as an end-to-end task model, but as a semantic backbone whose effectiveness depends on downstream spatial modeling aligned with fruit-scale and aggregation structures, providing guidance for blueberry robotic harvesting. Code and dataset will be available upon acceptance.
Problem

Research questions and friction points this paper is trying to address.

blueberry harvesting
visual perception
fruit detection
bruise segmentation
vision foundation models
Innovation

Methods, ideas, or system contributions that make the work stand out.

DINOv3
foundation models
blueberry harvesting
self-supervised learning
visual perception
🔎 Similar Papers
No similar papers found.
R
Rui-Feng Wang
Bio-Sensing, Automation, and Intelligence Laboratory, Department of Agricultural and Biological Engineering, Institute of Food and Agricultural Sciences, University of Florida, Gainesville, FL 32611, USA
D
Daniel Petti
Bio-Sensing, Automation, and Intelligence Laboratory, Department of Agricultural and Biological Engineering, Institute of Food and Agricultural Sciences, University of Florida, Gainesville, FL 32611, USA
Yue Chen
Yue Chen
Associate Professor, Georgia Institute of Technology and Emory University
Medical RoboticsSoft RoboticsContinuum Robots
Changying Li
Changying Li
Professor of Agricultural & Biological Engineering, University of Florida; Adjunct Professor at UGA
sensingagricultural roboticsmachine learningphenomicsprecision agriculture