🤖 AI Summary
To address insufficient trustworthiness of multimodal multi-agent systems in zero-shot scenarios, this paper proposes a trust-aware modular visual classification architecture. It decouples perception (CLIP-based image retrieval) from meta-reasoning (RAG-augmented language modeling), and introduces a dynamic trust calibration mechanism coupled with an iterative re-evaluation loop. Confidence quantification and regulation are achieved via metrics including Expected Calibration Error (ECE) and Concordance Correlation Coefficient (CCC), effectively mitigating agent overconfidence. Evaluated on a zero-shot apple leaf disease diagnosis task, our system achieves 85.63% accuracy—representing a 77.94% improvement over baseline methods. GPT-4o demonstrates superior calibration capability, while image-specific RAG substantially enhances reasoning reliability. All code and experimental configurations are fully open-sourced to ensure reproducibility.
📝 Abstract
Modern Artificial Intelligence (AI) increasingly relies on multi-agent architectures that blend visual and language understanding. Yet, a pressing challenge remains: How can we trust these agents especially in zero-shot settings with no fine-tuning? We introduce a novel modular Agentic AI visual classification framework that integrates generalist multimodal agents with a non-visual reasoning orchestrator and a Retrieval-Augmented Generation (RAG) module. Applied to apple leaf disease diagnosis, we benchmark three configurations: (I) zero-shot with confidence-based orchestration, (II) fine-tuned agents with improved performance, and (III) trust-calibrated orchestration enhanced by CLIP-based image retrieval and re-evaluation loops. Using confidence calibration metrics (ECE, OCR, CCC), the orchestrator modulates trust across agents. Our results demonstrate a 77.94% accuracy improvement in the zero-shot setting using trust-aware orchestration and RAG, achieving 85.63% overall. GPT-4o showed better calibration, while Qwen-2.5-VL displayed overconfidence. Furthermore, image-RAG grounded predictions with visually similar cases, enabling correction of agent overconfidence via iterative re-evaluation. The proposed system separates perception (vision agents) from meta-reasoning (orchestrator), enabling scalable and interpretable multi-agent AI. This blueprint is extensible to diagnostics, biology, and other trust-critical domains. All models, prompts, results, and system components including the complete software source code are openly released to support reproducibility, transparency, and community benchmarking at Github: https://github.com/Applied-AI-Research-Lab/Orchestrator-Agent-Trust