🤖 AI Summary
This work addresses the challenge that existing CAPTCHA verification mechanisms—relying on complex multi-step visual reasoning and interaction—significantly hinder end-to-end automation of intelligent agents on real-world websites, a problem exacerbated by limited training data and the absence of process-level annotations. To overcome these limitations, the authors introduce CaptchaBench, the first large-scale CAPTCHA benchmark featuring fine-grained region and explicit reasoning-process annotations, alongside CaptchaMind, a reinforcement learning–based solver that leverages explicit reasoning supervision to enhance its ability to handle intricate visual details and region-comparison tasks. Experimental results demonstrate that the proposed approach achieves an average success rate of 82.9% across eight task categories and 71.0% on real-world CAPTCHA instances, substantially outperforming all existing methods that do not rely on proprietary APIs.
📝 Abstract
CAPTCHAs are widely deployed as human verification mechanisms and frequently block intelligent agents from completing end-to-end automation in real-world web environments. Solving modern CAPTCHAs requires robust multi-step visual reasoning and interaction capabilities, yet training-based approaches have remained absent due to the lack of large-scale training data and process-level annotations. We introduce CaptchaBench, the first CAPTCHA benchmark designed to support large-scale training, comprising 16,000 programmatically generated samples across eight task categories with detailed region and process-level annotations. Systematic evaluation on CaptchaBench reveals that existing methods fail consistently on tasks requiring fine-grained visual detail capture and region-level comparison. We therefore present CaptchaMind, an RL-based solver trained with explicit reasoning process supervision, achieving 82.9% average success rate across eight tasks and 71.0% on real-world instances, substantially outperforming all existing methods without closed-source APIs.