🤖 AI Summary
This work addresses the limitation of random GUI testing, which often becomes trapped in UI exploration tarpits, leading to insufficient coverage and missed defects. To overcome this, the authors introduce large language models (LLMs) into random testing for the first time, proposing two hybrid tools—HybridMonkey and HybridDroidbot—that monitor UI similarity to detect tarpits and leverage LLMs to recommend effective events that actively escape local regions, thereby enhancing exploration efficiency. Evaluated on twelve real-world applications, the approaches achieve average activity coverage improvements of 54.8% and 44.8%, respectively, uncovering 75 unique bugs—including 34 previously unknown ones—26 of which have been confirmed and fixed. Experiments on WeChat further demonstrate significant superiority over conventional random testing.
📝 Abstract
Random GUI testing is a widely-used technique for testing mobile apps. However, its effectiveness is limited by the notorious issue -- UI exploration tarpits, where the exploration is trapped in local UI regions, thus impeding test coverage and bug discovery. In this experience paper, we introduce LLM-powered random GUI Testing, a novel hybrid testing approach to mitigating UI tarpits during random testing. Our approach monitors UI similarity to identify tarpits and query LLMs to suggest promising events for escaping the encountered tarpits. We implement our approach on top of two different automated input generation (AIG) tools for mobile apps: (1) HybridMonkey upon Monkey, a state-of-the-practice tool; and (2) HybridDroidbot upon Droidbot, a state-of-the-art tool. We evaluated them on 12 popular, real-world apps. The results show that HybridMonkey and HybridDroidbot outperform all baselines, achieving average coverage improvements of 54.8% and 44.8%, respectively, and detecting the highest number of unique crashes. In total, we found 75 unique bugs, including 34 previously unknown bugs. To date, 26 bugs have been confirmed and fixed. We also applied HybridMonkey on WeChat, a popular industrial app with billions of monthly active users. HybridMonkey achieved higher activity coverage and found more bugs than random testing.