🤖 AI Summary
Existing large language model (LLM) evaluation benchmarks lack coverage of mobile application development scenarios, failing to capture platform-specific constraints, framework lifecycles, and complex API interactions inherent to such environments. To address this gap, this work introduces the first multilingual repair benchmark targeting three major mobile platforms—Android Native, React Native, and Flutter—comprising 384 real-world production issues. Each task is accompanied by an executable test patch enabling automated validation of cross-file and cross-artifact modifications. Experimental results reveal that state-of-the-art code LLMs achieve end-to-end repair success rates of only 3.39%–5.21% on this benchmark, substantially lower than their performance on existing datasets, thereby exposing critical limitations in multi-file fault localization and coordinated repair capabilities.
📝 Abstract
Large language models (LLMs) have shown strong performance on automated software engineering tasks, yet existing benchmarks focus primarily on general-purpose libraries or web applications, leaving mobile application development largely unexplored despite its strict platform constraints, framework-driven lifecycles, and complex platform API interactions. We introduce MobileDev-Bench, a benchmark comprising 384 real-world issue-resolution tasks collected from 18 production mobile applications spanning Android Native (Java/Kotlin), React Native (TypeScript), and Flutter (Dart). Each task pairs an authentic developer-reported issue with executable test patches, enabling fully automated validation of model-generated fixes within mobile build environments. The benchmark exhibits substantial patch complexity: fixes modify 12.5 files and 324.9 lines on average, and 35.7% of instances require coordinated changes across multiple artifact types, such as source and manifest files. Evaluation of four state-of-the-art code-capable LLMs, GPT- 5.2, Claude Sonnet 4.5, Gemini Flash 2.5, and Qwen3-Coder, yields low end-to-end resolution rates of 3.39%-5.21%, revealing significant performance gaps compared to prior benchmarks. Further analysis reveals systematic failure modes, with fault localization across multi-file and multi-artifact changes emerging as the primary bottleneck.