Can Vision Replace Text in Working Memory? Evidence from Spatial n-Back in Vision-Language Models

📅 2026-02-04

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This study investigates whether visual modality can substitute for textual modality in the working memory of multimodal large language models, focusing on performance differences in spatial n-back tasks. By controlling grid-based n-back tasks rendered in either text or images, the authors systematically compare Qwen2.5 and Qwen2.5-VL across modalities using trial-wise log-probability analyses, d' sensitivity measures, and manipulations of grid size. Results reveal that both models perform significantly better under textual conditions than visual ones, predominantly relying on recency-based repetition strategies rather than the instruction-specified lag mechanism. Furthermore, grid size substantially influences interference patterns and error types. This work provides the first evidence of modality-dependent computational biases in the working memory mechanisms of multimodal large language models.

Technology Category

Application Category

📝 Abstract

Working memory is a central component of intelligent behavior, providing a dynamic workspace for maintaining and updating task-relevant information. Recent work has used n-back tasks to probe working-memory-like behavior in large language models, but it is unclear whether the same probe elicits comparable computations when information is carried in a visual rather than textual code in vision-language models. We evaluate Qwen2.5 and Qwen2.5-VL on a controlled spatial n-back task presented as matched text-rendered or image-rendered grids. Across conditions, models show reliably higher accuracy and d'with text than with vision. To interpret these differences at the process level, we use trial-wise log-probability evidence and find that nominal 2/3-back often fails to reflect the instructed lag and instead aligns with a recency-locked comparison. We further show that grid size alters recent-repeat structure in the stimulus stream, thereby changing interference and error patterns. These results motivate computation-sensitive interpretations of multimodal working memory.

Problem

Research questions and friction points this paper is trying to address.

working memory

vision-language models

n-back task

visual encoding

textual encoding

Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-language models

working memory

n-back task