Do Multi-Agents Solve Better Than Single? Evaluating Agentic Frameworks for Diagram-Grounded Geometry Problem Solving and Reasoning

📅 2025-12-18

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

This study systematically investigates the performance gap between multi-agent and single-agent frameworks in diagrammatic geometric reasoning. Method: We design a structured, multi-agent collaborative reasoning architecture—incorporating explicit geometric parsing modules—built upon Qwen-2.5-VL (7B/32B) and Gemini-2.0-Flash, and conduct unified evaluation across four visual mathematical benchmarks: Geometry3K, MathVerse, OlympiadBench, and We-Math. Contribution/Results: We present the first empirical evidence that multi-agent orchestration consistently improves open-weight MLLMs (e.g., +6.8/+3.3 points on Geometry3K for Qwen-2.5-VL 7B/32B) and significantly enhances zero-shot generalization of proprietary models on novel benchmarks. However, task decomposition is not universally optimal. All code, data, and inference logs are publicly released, establishing a reproducible benchmark for framework design in visual mathematical reasoning and offering actionable insights for future research.

Technology Category

Application Category

📝 Abstract

Diagram-grounded geometry problem solving is a critical benchmark for multimodal large language models (MLLMs), yet the benefits of multi-agent design over single-agent remain unclear. We systematically compare single-agent and multi-agent pipelines on four visual math benchmarks: Geometry3K, MathVerse, OlympiadBench, and We-Math. For open-source models, multi-agent consistently improves performance. For example, Qwen-2.5-VL (7B) gains +6.8 points and Qwen-2.5-VL (32B) gains +3.3 on Geometry3K, and both Qwen-2.5-VL variants see further gains on OlympiadBench and We-Math. In contrast, the closed-source Gemini-2.0-Flash generally performs better in single-agent mode on classic benchmarks, while multi-agent yields only modest improvements on the newer We-Math dataset. These findings show that multi-agent pipelines provide clear benefits for open-source models and can assist strong proprietary systems on newer, less familiar benchmarks, but agentic decomposition is not universally optimal. All code, data, and reasoning files are available at https://github.com/faiyazabdullah/Interpreter-Solver

Problem

Research questions and friction points this paper is trying to address.

Evaluating multi-agent vs single-agent frameworks for geometry problem solving

Comparing performance on visual math benchmarks using MLLMs

Assessing benefits of agentic decomposition across different model types

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent pipelines improve open-source model performance

Agentic decomposition not universally optimal across all models

Multi-agent assists proprietary systems on newer benchmarks

🔎 Similar Papers

Scalable and Accurate Graph Reasoning with LLM-based Multi-Agents