🤖 AI Summary
This work identifies a critical perceptual deficiency in current multimodal agents—exemplified by OpenAI’s Computer Use Agent (CUA)—when operating in real-world GUI environments. We conduct a systematic, end-to-end evaluation of CUA on *The New York Times* Wordle game, where the agent receives raw screen pixels as input and produces keyboard/mouse actions as output. Results show a mere 5.36% task success rate. The root cause is a severe context-sensitive color misperception: the model fails to reliably distinguish semantically critical colors (e.g., green/yellow/gray feedback tiles), leading to fundamental visual-semantic misalignment and subsequent breakdown of the decision chain. To our knowledge, this is the first empirical demonstration—in a standard human-computer interface—of task-level fragility in multimodal agents arising from color perception generalization failure. Our findings challenge optimistic assumptions about AGI progress and establish a new benchmark and diagnostic paradigm for evaluating multimodal robustness.
📝 Abstract
This paper investigates multimodal agents, in particular, OpenAI's Computer-User Agent (CUA), trained to control and complete tasks through a standard computer interface, similar to humans. We evaluated the agent's performance on the New York Times Wordle game to elicit model behaviors and identify shortcomings. Our findings revealed a significant discrepancy in the model's ability to recognize colors correctly depending on the context. The model had a $5.36%$ success rate over several hundred runs across a week of Wordle. Despite the immense enthusiasm surrounding AI agents and their potential to usher in Artificial General Intelligence (AGI), our findings reinforce the fact that even simple tasks present substantial challenges for today's frontier AI models. We conclude with a discussion of the potential underlying causes, implications for future development, and research directions to improve these AI systems.