AGI Is Coming... Right After AI Learns to Play Wordle

📅 2025-04-21

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work identifies a critical perceptual deficiency in current multimodal agents—exemplified by OpenAI’s Computer Use Agent (CUA)—when operating in real-world GUI environments. We conduct a systematic, end-to-end evaluation of CUA on *The New York Times* Wordle game, where the agent receives raw screen pixels as input and produces keyboard/mouse actions as output. Results show a mere 5.36% task success rate. The root cause is a severe context-sensitive color misperception: the model fails to reliably distinguish semantically critical colors (e.g., green/yellow/gray feedback tiles), leading to fundamental visual-semantic misalignment and subsequent breakdown of the decision chain. To our knowledge, this is the first empirical demonstration—in a standard human-computer interface—of task-level fragility in multimodal agents arising from color perception generalization failure. Our findings challenge optimistic assumptions about AGI progress and establish a new benchmark and diagnostic paradigm for evaluating multimodal robustness.

Technology Category

Application Category

📝 Abstract

This paper investigates multimodal agents, in particular, OpenAI's Computer-User Agent (CUA), trained to control and complete tasks through a standard computer interface, similar to humans. We evaluated the agent's performance on the New York Times Wordle game to elicit model behaviors and identify shortcomings. Our findings revealed a significant discrepancy in the model's ability to recognize colors correctly depending on the context. The model had a $5.36%$ success rate over several hundred runs across a week of Wordle. Despite the immense enthusiasm surrounding AI agents and their potential to usher in Artificial General Intelligence (AGI), our findings reinforce the fact that even simple tasks present substantial challenges for today's frontier AI models. We conclude with a discussion of the potential underlying causes, implications for future development, and research directions to improve these AI systems.

Problem

Research questions and friction points this paper is trying to address.

Evaluating AI performance on simple tasks like Wordle

Identifying color recognition discrepancies in AI models

Addressing challenges in achieving Artificial General Intelligence

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal agents control computer interfaces

Evaluated on Wordle game performance

Identified color recognition discrepancy issues

🔎 Similar Papers

No similar papers found.