🤖 AI Summary
Traditional single-species plant counting models suffer from poor generalization due to the continuous emergence of new plant species and highly variable imaging conditions—including diverse scenes, scales, and occlusions.
Method: We propose the first vision foundation model for universal plant counting—capable of cross-scene, cross-scale, and cross-species generalization. Inspired by class-agnostic counting, we introduce a multi-branch box-aware local counting module that jointly performs local density estimation and feature extraction–matching to robustly model plants’ dynamic, non-rigid structures. The model is built upon a pure vision Transformer architecture and trained on two newly constructed datasets: PAC-105 and PAC-Somalia.
Results: Extensive experiments demonstrate significant improvements over state-of-the-art class-agnostic counting methods across multiple challenging benchmarks, achieving higher accuracy (18.3% lower MAE), strong robustness to scale and scene variations, and efficient inference—establishing a scalable foundation model paradigm for plant biodiversity monitoring.
📝 Abstract
Accurate plant counting provides valuable information for agriculture such as crop yield prediction, plant density assessment, and phenotype quantification. Vision-based approaches are currently the mainstream solution. Prior art typically uses a detection or a regression model to count a specific plant. However, plants have biodiversity, and new cultivars are increasingly bred each year. It is almost impossible to exhaust and build all species-dependent counting models. Inspired by class-agnostic counting (CAC) in computer vision, we argue that it is time to rethink the problem formulation of plant counting, from what plants to count to how to count plants. In contrast to most daily objects with spatial and temporal invariance, plants are dynamic, changing with time and space. Their non-rigid structure often leads to worse performance than counting rigid instances like heads and cars such that current CAC and open-world detection models are suboptimal to count plants. In this work, we inherit the vein of the TasselNet plant counting model and introduce a new extension, TasselNetV4, shifting from species-specific counting to cross-species counting. TasselNetV4 marries the local counting idea of TasselNet with the extract-and-match paradigm in CAC. It builds upon a plain vision transformer and incorporates novel multi-branch box-aware local counters used to enhance cross-scale robustness. Two challenging datasets, PAC-105 and PAC-Somalia, are harvested. Extensive experiments against state-of-the-art CAC models show that TasselNetV4 achieves not only superior counting performance but also high efficiency.Our results indicate that TasselNetV4 emerges to be a vision foundation model for cross-scene, cross-scale, and cross-species plant counting.