Agent-ScanKit: Unraveling Memory and Reasoning of Multimodal Agents via Sensitivity Perturbations

📅 2025-10-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current GUI-based multimodal agents exhibit insufficient reliability on complex or cross-domain tasks, often relying on spurious correlations rather than genuine reasoning. To address this, we propose Agent-ScanKit—the first non-intrusive probing framework tailored for GUI scenarios. It employs three orthogonal sensitivity perturbations—visual, textual, and structural—to disentangle and quantify the respective contributions of memory retrieval and system-level reasoning in agents, without requiring internal model access. Extensive experiments across five public GUI benchmarks and eighteen state-of-the-art agents reveal that the vast majority predominantly rely on memorized training-data alignments, exhibiting severe deficits in compositional and generalizable reasoning. This work provides the first empirical evidence characterizing modern multimodal agents as “memory-dominated and reasoning-deficient.” By offering a principled diagnostic tool and actionable insights, Agent-ScanKit establishes a critical foundation for developing next-generation agents with robust, generalizable reasoning capabilities.

Technology Category

Application Category

📝 Abstract
Although numerous strategies have recently been proposed to enhance the autonomous interaction capabilities of multimodal agents in graphical user interface (GUI), their reliability remains limited when faced with complex or out-of-domain tasks. This raises a fundamental question: Are existing multimodal agents reasoning spuriously? In this paper, we propose extbf{Agent-ScanKit}, a systematic probing framework to unravel the memory and reasoning capabilities of multimodal agents under controlled perturbations. Specifically, we introduce three orthogonal probing paradigms: visual-guided, text-guided, and structure-guided, each designed to quantify the contributions of memorization and reasoning without requiring access to model internals. In five publicly available GUI benchmarks involving 18 multimodal agents, the results demonstrate that mechanical memorization often outweighs systematic reasoning. Most of the models function predominantly as retrievers of training-aligned knowledge, exhibiting limited generalization. Our findings underscore the necessity of robust reasoning modeling for multimodal agents in real-world scenarios, offering valuable insights toward the development of reliable multimodal agents.
Problem

Research questions and friction points this paper is trying to address.

Probing multimodal agents' memory and reasoning capabilities
Quantifying memorization versus systematic reasoning contributions
Assessing generalization limitations in GUI-based multimodal agents
Innovation

Methods, ideas, or system contributions that make the work stand out.

Probing framework for multimodal agent analysis
Three orthogonal sensitivity perturbation paradigms
Quantifying memorization versus reasoning contributions
🔎 Similar Papers
No similar papers found.