FruitEnsemble: MLLM-Guided Arbitration for Heterogeneous ensemble in Fine-Grained Fruit Recognition

📅 2026-05-20

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses the challenges of fine-grained fruit recognition—namely, the scarcity of high-quality labeled data and the high visual similarity among categories—by introducing a large-scale dataset encompassing 306 fruit classes. The authors propose a two-stage dynamic inference framework: in the first stage, a verification-calibrated ensemble of heterogeneous models generates a Top-3 candidate set; for low-confidence samples, the second stage employs a novel chain-of-thought arbitration mechanism guided by a multimodal large language model (MLLM). Coupled with a hard-sample-aware joint loss, this approach significantly enhances generalization. Evaluated on the newly curated dataset, the method achieves a classification accuracy of 70.49%, outperforming current state-of-the-art approaches and demonstrating strong potential for real-world deployment in agricultural visual sorting and quality inspection systems.

📝 Abstract

Fine-grained fruit classification is a critical yet challenging task in agricultural computer vision, primarily hindered by a severe shortage of high-quality datasets and the high visual similarity between classes. To address these challenges, we first constructed a comprehensive dataset comprising 306 fruit categories with 116,233 samples. Moreover, we propose FruitEnsemble, a practical two-stage dynamic inference framework designed to overcome the generalization limitations of static single-model architectures. In the first stage, FruitEnsemble employs a validation-calibrated weighted ensemble of heterogeneous backbones to generate a robust Top-3 candidate pool. To tackle difficult samples, we introduce an expert arbitration mechanism: when ensemble confidence falls below 0.6, a multimodal large language model (MLLM) is triggered to perform rigorous visual verification by integrating external botanical descriptions using Chain-of-Thought (CoT) reasoning. Furthermore, we optimized the training pipeline with a hard sample-aware joint loss. Extensive experiments demonstrate that FruitEnsemble achieves a classification accuracy of 70.49\% and outperforms existing state-of-the-art models. Our framework provides an efficient, deployment-oriented solution for real-world agricultural visual sorting and quality inspection tasks.

Problem

Research questions and friction points this paper is trying to address.

fine-grained fruit recognition

data scarcity

visual similarity

heterogeneous ensemble

agricultural computer vision

Innovation

Methods, ideas, or system contributions that make the work stand out.

FruitEnsemble

multimodal large language model

heterogeneous ensemble