Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM Agents

📅 2025-05-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal large language model (MLLM) agents lack systematic evaluation of multi-step visual reasoning and interactive action capabilities in realistic web interaction scenarios—particularly CAPTCHA solving. Method: We introduce the first CAPTCHA benchmark platform tailored for real-world web interaction, comprising 225 modern CAPTCHAs across 20 categories, and propose “CAPTCHA Reasoning Depth” as a novel metric to quantify reasoning complexity. We establish a web-native, interactive, and extensible multimodal agent evaluation paradigm, integrating browser automation with MLLMs (e.g., Browser-Use OpenAI-o3) to support end-to-end visual perception, action planning, and real-time interaction. Contribution/Results: Human participants achieve 93.3% success; state-of-the-art MLLM agents attain only 40.0%, exposing fundamental limitations in dynamic, interactive reasoning. This benchmark provides critical infrastructure and concrete directions for advancing embodied intelligent agents.

Technology Category

Application Category

📝 Abstract
CAPTCHAs have been a critical bottleneck for deploying web agents in real-world applications, often blocking them from completing end-to-end automation tasks. While modern multimodal LLM agents have demonstrated impressive performance in static perception tasks, their ability to handle interactive, multi-step reasoning challenges like CAPTCHAs is largely untested. To address this gap, we introduce Open CaptchaWorld, the first web-based benchmark and platform specifically designed to evaluate the visual reasoning and interaction capabilities of MLLM-powered agents through diverse and dynamic CAPTCHA puzzles. Our benchmark spans 20 modern CAPTCHA types, totaling 225 CAPTCHAs, annotated with a new metric we propose: CAPTCHA Reasoning Depth, which quantifies the number of cognitive and motor steps required to solve each puzzle. Experimental results show that humans consistently achieve near-perfect scores, state-of-the-art MLLM agents struggle significantly, with success rates at most 40.0% by Browser-Use Openai-o3, far below human-level performance, 93.3%. This highlights Open CaptchaWorld as a vital benchmark for diagnosing the limits of current multimodal agents and guiding the development of more robust multimodal reasoning systems. Code and Data are available at this https URL.
Problem

Research questions and friction points this paper is trying to address.

Evaluating MLLM agents' ability to solve diverse CAPTCHAs
Measuring CAPTCHA Reasoning Depth for cognitive steps
Benchmarking human vs MLLM performance on CAPTCHAs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Web-based platform for multimodal LLM testing
Diverse CAPTCHAs benchmark with new metric
Evaluates visual reasoning and interaction capabilities
🔎 Similar Papers
No similar papers found.