🤖 AI Summary
Traditional image classification relies heavily on large-scale annotated datasets and parameter-intensive fine-tuning, while single visual-language models (VLMs) often fail to capture the multidimensional semantic structure of visual content. To address these limitations, this paper proposes a multi-agent collaborative reasoning framework that decomposes image classification into three coordinated stages: global theme modeling (performed by an Outliner Agent), multi-dimensional fine-grained description generation (handled by Aspect Agents), and integrative reflective decision-making (executed by a Reasoning Agent). The framework replaces end-to-end training with prompt-driven, stepwise reasoning, significantly reducing dependence on extensive labeled data and model fine-tuning while enhancing robustness and interpretability. Evaluated on four standard benchmarks, it consistently outperforms state-of-the-art VLM baselines, demonstrating clear advantages in both classification accuracy and reasoning transparency. This work establishes multi-agent collaboration as a promising paradigm for interpretable, data-efficient visual understanding.
📝 Abstract
Image classification has traditionally relied on parameter-intensive model training, requiring large-scale annotated datasets and extensive fine tuning to achieve competitive performance. While recent vision language models (VLMs) alleviate some of these constraints, they remain limited by their reliance on single pass representations, often failing to capture complementary aspects of visual content. In this paper, we introduce Multi Agent based Reasoning for Image Classification (MARIC), a multi agent framework that reformulates image classification as a collaborative reasoning process. MARIC first utilizes an Outliner Agent to analyze the global theme of the image and generate targeted prompts. Based on these prompts, three Aspect Agents extract fine grained descriptions along distinct visual dimensions. Finally, a Reasoning Agent synthesizes these complementary outputs through integrated reflection step, producing a unified representation for classification. By explicitly decomposing the task into multiple perspectives and encouraging reflective synthesis, MARIC mitigates the shortcomings of both parameter-heavy training and monolithic VLM reasoning. Experiments on 4 diverse image classification benchmark datasets demonstrate that MARIC significantly outperforms baselines, highlighting the effectiveness of multi-agent visual reasoning for robust and interpretable image classification.