PixelCraft: A Multi-Agent System for High-Fidelity Visual Reasoning on Structured Images

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal large language models (MLLMs) often exhibit reasoning biases in structured image understanding (e.g., charts, geometric diagrams) due to subtle perceptual errors; existing approaches are constrained by low-fidelity visual processing and rigid linear reasoning paradigms. To address this, we propose a multi-agent collaborative framework featuring three core innovations: (1) a dynamic three-stage workflow that synergistically integrates pixel-level localization capabilities of MLLMs with classical computer vision algorithms to construct high-fidelity visual representations; (2) an image memory mechanism enabling reasoning path backtracking and parallel branch exploration, thereby overcoming the limitations of sequential inference; and (3) specialized visual tool agents for adaptive, fine-grained collaboration. Our method achieves significant improvements over state-of-the-art methods across multiple chart and geometric reasoning benchmarks, establishing a new benchmark for structured image understanding.

Technology Category

Application Category

📝 Abstract
Structured images (e.g., charts and geometric diagrams) remain challenging for multimodal large language models (MLLMs), as perceptual slips can cascade into erroneous conclusions. Intermediate visual cues can steer reasoning; however, existing cue-based methods are constrained with low-fidelity image processing and linear, rigid reasoning patterns, limiting their effectiveness on complex structured-image tasks. In this paper, we propose PixelCraft, a novel multi-agent system for high-fidelity image processing and flexible visual reasoning on structured images. The system comprises a dispatcher, a planner, a reasoner, critics, and a set of visual tool agents. To achieve high-fidelity processing, we construct a high-quality corpus and fine-tune an MLLM into a grounding model, whose pixel-level localizations are integrated with traditional computer vision (CV) algorithms in tool agents. Building on this foundation, PixelCraft facilitates flexible visual reasoning through a dynamic three-stage workflow of tool selection, agent discussion, and self-criticism. Moreover, unlike prior linear reasoning patterns that simply append historical images, PixelCraft maintains an image memory to allow the planner to adaptively revisit earlier visual steps, explore alternative reasoning branches, and dynamically adjust the reasoning trajectory during discussion. Extensive experiments on challenging chart and geometry benchmarks demonstrate that PixelCraft significantly improves visual reasoning performance for advanced MLLMs, setting a new standard for structured image reasoning. Our code will be available at https://github.com/microsoft/PixelCraft.
Problem

Research questions and friction points this paper is trying to address.

Addresses visual reasoning challenges on structured images like charts and diagrams
Overcomes low-fidelity image processing and rigid reasoning patterns in existing methods
Solves cascading errors from perceptual slips in multimodal language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent system for high-fidelity structured image processing
Pixel-level localization integrated with computer vision algorithms
Dynamic workflow with image memory for adaptive reasoning
🔎 Similar Papers
No similar papers found.
S
Shuoshuo Zhang
Microsoft Research
Z
Zijian Li
Hong Kong University of Science and Technology
Y
Yizhen Zhang
Tsinghua University
Jingjing Fu
Jingjing Fu
MS
image/video processing
L
Lei Song
Microsoft Research
J
Jiang Bian
Microsoft Research
J
Jun Zhang
Hong Kong University of Science and Technology
Yujiu Yang
Yujiu Yang
SIGS, Tsinghua University
Machine Learning, Nature language processing, Computer vision
R
Rui Wang
Microsoft Research