Tile-Based ViT Inference with Visual-Cluster Priors for Zero-Shot Multi-Species Plant Identification

📅 2025-07-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses zero-shot recognition of multiple plant species in vegetation quadrat images. We propose a training-free patch-wise Vision Transformer (ViT) inference framework. Methodologically, we jointly model domain knowledge via unsupervised visual clustering—using PaCMAP for dimensionality reduction followed by K-Means clustering—and geolocation-based filtering. A cluster-specific Bayesian weighting strategy is further designed to fuse predictions from a patch-wise ViT (ViTD2PC24All, 4×4 patching). Our key contribution lies in explicitly integrating unsupervised visual structure and geographic priors into the zero-shot inference pipeline, eliminating the need for fine-tuning or any additional training. Evaluated on the PlantCLEF 2025 private leaderboard, our approach achieves a macro-F1 score of 0.348, ranking second. All code and configuration files are publicly released.

Technology Category

Application Category

📝 Abstract
We describe DS@GT's second-place solution to the PlantCLEF 2025 challenge on multi-species plant identification in vegetation quadrat images. Our pipeline combines (i) a fine-tuned Vision Transformer ViTD2PC24All for patch-level inference, (ii) a 4x4 tiling strategy that aligns patch size with the network's 518x518 receptive field, and (iii) domain-prior adaptation through PaCMAP + K-Means visual clustering and geolocation filtering. Tile predictions are aggregated by majority vote and re-weighted with cluster-specific Bayesian priors, yielding a macro-averaged F1 of 0.348 (private leaderboard) while requiring no additional training. All code, configuration files, and reproducibility scripts are publicly available at https://github.com/dsgt-arc/plantclef-2025.
Problem

Research questions and friction points this paper is trying to address.

Zero-shot multi-species plant identification challenge
Tile-based ViT inference with visual-cluster priors
No additional training for vegetation quadrat images
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuned Vision Transformer for patch-level inference
4x4 tiling strategy matching network receptive field
Visual clustering and geolocation filtering for domain adaptation
🔎 Similar Papers
No similar papers found.