Analyze-Prompt-Reason: A Collaborative Agent-Based Framework for Multi-Image Vision-Language Reasoning

📅 2025-08-01
📈 Citations: 0
Influential: 0
📄 PDF

career value

181K/year
🤖 AI Summary
This work addresses the challenging problem of cross-image, multi-task (classification, visual question answering, and generation) vision-language compositional reasoning. We propose a training-free, modular dual-agent collaboration framework comprising a vision reasoning agent and a context-aware PromptEngineer. The framework enables end-to-end inference via zero-shot prompting, multi-round input optimization, and cross-modal alignment. Its core innovation lies in decoupling task logic from model capabilities, enabling plug-and-play generalization across 18 heterogeneous datasets. Evaluated on TQA, DocVQA, and MMCoQA, it achieves 99.13% accuracy, 96.87% accuracy, and 75.28 ROUGE-L, respectively—performance approaching human-level capability. The method significantly enhances both generality and robustness for complex, multi-image vision-language tasks.

Technology Category

Application Category

📝 Abstract
We present a Collaborative Agent-Based Framework for Multi-Image Reasoning. Our approach tackles the challenge of interleaved multimodal reasoning across diverse datasets and task formats by employing a dual-agent system: a language-based PromptEngineer, which generates context-aware, task-specific prompts, and a VisionReasoner, a large vision-language model (LVLM) responsible for final inference. The framework is fully automated, modular, and training-free, enabling generalization across classification, question answering, and free-form generation tasks involving one or multiple input images. We evaluate our method on 18 diverse datasets from the 2025 MIRAGE Challenge (Track A), covering a broad spectrum of visual reasoning tasks including document QA, visual comparison, dialogue-based understanding, and scene-level inference. Our results demonstrate that LVLMs can effectively reason over multiple images when guided by informative prompts. Notably, Claude 3.7 achieves near-ceiling performance on challenging tasks such as TQA (99.13% accuracy), DocVQA (96.87%), and MMCoQA (75.28 ROUGE-L). We also explore how design choices-such as model selection, shot count, and input length-influence the reasoning performance of different LVLMs.
Problem

Research questions and friction points this paper is trying to address.

Enables multi-image vision-language reasoning across diverse tasks
Automates interleaved multimodal reasoning with dual-agent collaboration
Evaluates framework performance on 18 diverse visual reasoning datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-agent system for multimodal reasoning
Automated modular training-free framework
Context-aware task-specific prompt generation