Agent-ScanKit: Unraveling Memory and Reasoning of Multimodal Agents via Sensitivity Perturbations

📅 2025-10-01

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Current GUI-based multimodal agents exhibit insufficient reliability on complex or cross-domain tasks, often relying on spurious correlations rather than genuine reasoning. To address this, we propose Agent-ScanKit—the first non-intrusive probing framework tailored for GUI scenarios. It employs three orthogonal sensitivity perturbations—visual, textual, and structural—to disentangle and quantify the respective contributions of memory retrieval and system-level reasoning in agents, without requiring internal model access. Extensive experiments across five public GUI benchmarks and eighteen state-of-the-art agents reveal that the vast majority predominantly rely on memorized training-data alignments, exhibiting severe deficits in compositional and generalizable reasoning. This work provides the first empirical evidence characterizing modern multimodal agents as “memory-dominated and reasoning-deficient.” By offering a principled diagnostic tool and actionable insights, Agent-ScanKit establishes a critical foundation for developing next-generation agents with robust, generalizable reasoning capabilities.

Technology Category

Application Category

📝 Abstract

Although numerous strategies have recently been proposed to enhance the autonomous interaction capabilities of multimodal agents in graphical user interface (GUI), their reliability remains limited when faced with complex or out-of-domain tasks. This raises a fundamental question: Are existing multimodal agents reasoning spuriously? In this paper, we propose extbf{Agent-ScanKit}, a systematic probing framework to unravel the memory and reasoning capabilities of multimodal agents under controlled perturbations. Specifically, we introduce three orthogonal probing paradigms: visual-guided, text-guided, and structure-guided, each designed to quantify the contributions of memorization and reasoning without requiring access to model internals. In five publicly available GUI benchmarks involving 18 multimodal agents, the results demonstrate that mechanical memorization often outweighs systematic reasoning. Most of the models function predominantly as retrievers of training-aligned knowledge, exhibiting limited generalization. Our findings underscore the necessity of robust reasoning modeling for multimodal agents in real-world scenarios, offering valuable insights toward the development of reliable multimodal agents.

Problem

Research questions and friction points this paper is trying to address.

Probing multimodal agents' memory and reasoning capabilities

Quantifying memorization versus systematic reasoning contributions

Assessing generalization limitations in GUI-based multimodal agents

Innovation

Methods, ideas, or system contributions that make the work stand out.

Probing framework for multimodal agent analysis

Three orthogonal sensitivity perturbation paradigms

Quantifying memorization versus reasoning contributions

🔎 Similar Papers

A Role of Environmental Complexity on Representation Learning in Deep Reinforcement Learning Agents