🤖 AI Summary
Existing multimodal large language model (MLLM) agents lack systematic evaluation of multi-step visual reasoning and interactive action capabilities in realistic web interaction scenarios—particularly CAPTCHA solving. Method: We introduce the first CAPTCHA benchmark platform tailored for real-world web interaction, comprising 225 modern CAPTCHAs across 20 categories, and propose “CAPTCHA Reasoning Depth” as a novel metric to quantify reasoning complexity. We establish a web-native, interactive, and extensible multimodal agent evaluation paradigm, integrating browser automation with MLLMs (e.g., Browser-Use OpenAI-o3) to support end-to-end visual perception, action planning, and real-time interaction. Contribution/Results: Human participants achieve 93.3% success; state-of-the-art MLLM agents attain only 40.0%, exposing fundamental limitations in dynamic, interactive reasoning. This benchmark provides critical infrastructure and concrete directions for advancing embodied intelligent agents.
📝 Abstract
CAPTCHAs have been a critical bottleneck for deploying web agents in real-world applications, often blocking them from completing end-to-end automation tasks. While modern multimodal LLM agents have demonstrated impressive performance in static perception tasks, their ability to handle interactive, multi-step reasoning challenges like CAPTCHAs is largely untested. To address this gap, we introduce Open CaptchaWorld, the first web-based benchmark and platform specifically designed to evaluate the visual reasoning and interaction capabilities of MLLM-powered agents through diverse and dynamic CAPTCHA puzzles. Our benchmark spans 20 modern CAPTCHA types, totaling 225 CAPTCHAs, annotated with a new metric we propose: CAPTCHA Reasoning Depth, which quantifies the number of cognitive and motor steps required to solve each puzzle. Experimental results show that humans consistently achieve near-perfect scores, state-of-the-art MLLM agents struggle significantly, with success rates at most 40.0% by Browser-Use Openai-o3, far below human-level performance, 93.3%. This highlights Open CaptchaWorld as a vital benchmark for diagnosing the limits of current multimodal agents and guiding the development of more robust multimodal reasoning systems. Code and Data are available at this https URL.