Analyze-Prompt-Reason: A Collaborative Agent-Based Framework for Multi-Image Vision-Language Reasoning

📅 2025-08-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenging problem of cross-image, multi-task (classification, visual question answering, and generation) vision-language compositional reasoning. We propose a training-free, modular dual-agent collaboration framework comprising a vision reasoning agent and a context-aware PromptEngineer. The framework enables end-to-end inference via zero-shot prompting, multi-round input optimization, and cross-modal alignment. Its core innovation lies in decoupling task logic from model capabilities, enabling plug-and-play generalization across 18 heterogeneous datasets. Evaluated on TQA, DocVQA, and MMCoQA, it achieves 99.13% accuracy, 96.87% accuracy, and 75.28 ROUGE-L, respectively—performance approaching human-level capability. The method significantly enhances both generality and robustness for complex, multi-image vision-language tasks.

Technology Category

Application Category

📝 Abstract
We present a Collaborative Agent-Based Framework for Multi-Image Reasoning. Our approach tackles the challenge of interleaved multimodal reasoning across diverse datasets and task formats by employing a dual-agent system: a language-based PromptEngineer, which generates context-aware, task-specific prompts, and a VisionReasoner, a large vision-language model (LVLM) responsible for final inference. The framework is fully automated, modular, and training-free, enabling generalization across classification, question answering, and free-form generation tasks involving one or multiple input images. We evaluate our method on 18 diverse datasets from the 2025 MIRAGE Challenge (Track A), covering a broad spectrum of visual reasoning tasks including document QA, visual comparison, dialogue-based understanding, and scene-level inference. Our results demonstrate that LVLMs can effectively reason over multiple images when guided by informative prompts. Notably, Claude 3.7 achieves near-ceiling performance on challenging tasks such as TQA (99.13% accuracy), DocVQA (96.87%), and MMCoQA (75.28 ROUGE-L). We also explore how design choices-such as model selection, shot count, and input length-influence the reasoning performance of different LVLMs.
Problem

Research questions and friction points this paper is trying to address.

Enables multi-image vision-language reasoning across diverse tasks
Automates interleaved multimodal reasoning with dual-agent collaboration
Evaluates framework performance on 18 diverse visual reasoning datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-agent system for multimodal reasoning
Automated modular training-free framework
Context-aware task-specific prompt generation
🔎 Similar Papers
No similar papers found.
A
Angelos Vlachos
Artificial Intelligence and Learning Systems Laboratory, National Technical University of Athens, Greece
Giorgos Filandrianos
Giorgos Filandrianos
Postdoctoral researcher
Explainable AINLP
Maria Lymperaiou
Maria Lymperaiou
National Technical University of Athens
Deep LearningNatural Language ProcessingExplainabilityMultimodal Learning
Nikolaos Spanos
Nikolaos Spanos
PhD Student, National Technical University of Athens
Computer VisionGenerative AIDomain Generalization
I
Ilias Mitsouras
Artificial Intelligence and Learning Systems Laboratory, National Technical University of Athens, Greece
V
Vasileios Karampinis
Artificial Intelligence and Learning Systems Laboratory, National Technical University of Athens, Greece
A
Athanasios Voulodimos
Artificial Intelligence and Learning Systems Laboratory, National Technical University of Athens, Greece