🤖 AI Summary
This work proposes a hierarchical, vision-language framework based on CLIP embeddings to address prediction instability in multi-view plant phenotyping caused by view redundancy and appearance variability. By introducing CLIP into multi-task plant phenotypic regression for the first time, the method leverages lightweight textual priors to conditionally model view information and aggregates multi-view images into a view-invariant representation, ensuring robustness even when input views are missing or unordered. The model jointly predicts plant age and leaf count within a unified architecture, achieving significant improvements on the GroMo25 benchmark: mean absolute error (MAE) is reduced by 49.5% for age estimation and by 44.2% for leaf count prediction, substantially outperforming existing baselines.
📝 Abstract
Modeling plant growth dynamics plays a central role in modern agricultural research. However, learning robust predictors from multi-view plant imagery remains challenging due to strong viewpoint redundancy and viewpoint-dependent appearance changes. We propose a level-aware vision language framework that jointly predicts plant age and leaf count using a single multi-task model built on CLIP embeddings. Our method aggregates rotational views into angle-invariant representations and conditions visual features on lightweight text priors encoding viewpoint level for stable prediction under incomplete or unordered inputs. On the GroMo25 benchmark, our approach reduces mean age MAE from 7.74 to 3.91 and mean leaf-count MAE from 5.52 to 3.08 compared to the GroMo baseline, corresponding to improvements of 49.5% and 44.2%, respectively. The unified formulation simplifies the pipeline by replacing the conventional dual-model setup while improving robustness to missing views. The models and code is available at: https://github.com/SimonWarmers/CLIP-MVP