GAM-Agent: Game-Theoretic and Uncertainty-Aware Collaboration for Complex Visual Reasoning

📅 2025-05-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Single vision-language models (VLMs) exhibit limitations in perceptual subtask handling, logical consistency, and factual correctness for complex visual-language reasoning. Method: We propose a multi-agent collaborative reasoning framework that formalizes inference as a non-zero-sum game between perception-specialized agents and a logic verification agent. An uncertainty-aware controller dynamically triggers iterative multi-round debates, while a game-theoretic, interpretable collaboration mechanism integrates structured reasoning chains, multi-agent communication protocols, and dynamic scheduling strategies. Contribution/Results: Our approach is the first to deeply couple uncertainty quantification with game-theoretic modeling, achieving robustness, traceability, and interpretability. On four major benchmarks—MMMU, MMBench, and others—it improves accuracy by 5–6% for small-to-medium VLMs (e.g., Qwen2.5-VL-7B) and by 2–3% even for strong models (e.g., GPT-4o), demonstrating broad applicability and effectiveness.

Technology Category

Application Category

📝 Abstract
We propose GAM-Agent, a game-theoretic multi-agent framework for enhancing vision-language reasoning. Unlike prior single-agent or monolithic models, GAM-Agent formulates the reasoning process as a non-zero-sum game between base agents--each specializing in visual perception subtasks--and a critical agent that verifies logic consistency and factual correctness. Agents communicate via structured claims, evidence, and uncertainty estimates. The framework introduces an uncertainty-aware controller to dynamically adjust agent collaboration, triggering multi-round debates when disagreement or ambiguity is detected. This process yields more robust and interpretable predictions. Experiments on four challenging benchmarks--MMMU, MMBench, MVBench, and V*Bench--demonstrate that GAM-Agent significantly improves performance across various VLM backbones. Notably, GAM-Agent boosts the accuracy of small-to-mid scale models (e.g., Qwen2.5-VL-7B, InternVL3-14B) by 5--6%, and still enhances strong models like GPT-4o by up to 2--3%. Our approach is modular, scalable, and generalizable, offering a path toward reliable and explainable multi-agent multimodal reasoning.
Problem

Research questions and friction points this paper is trying to address.

Enhancing vision-language reasoning via game-theoretic multi-agent collaboration
Improving robustness and interpretability through uncertainty-aware dynamic adjustments
Boosting performance of VLMs across diverse benchmarks and model scales
Innovation

Methods, ideas, or system contributions that make the work stand out.

Game-theoretic multi-agent framework for reasoning
Uncertainty-aware controller adjusts agent collaboration
Modular scalable approach enhances model performance
🔎 Similar Papers
No similar papers found.
J
Jusheng Zhang
Sun Yat-sen University
Y
Yijia Fan
Sun Yat-sen University
W
Wenjun Lin
Sun Yat-sen University
Ruiqi Chen
Ruiqi Chen
Vrije Universiteit Brussel
FPGAsDomain-specific Accelerator
Haoyi Jiang
Haoyi Jiang
Huazhong University of Science and Technology
Computer VisionAutonomous Driving
Wenhao Chai
Wenhao Chai
Princeton University
Machine LearningComputer Vision
J
Jian Wang
Snap Inc.
K
Keze Wang
Sun Yat-sen University