🤖 AI Summary
Existing mobile agents struggle to decouple and jointly enhance mixed capabilities—such as screen understanding, subtask planning, and action execution—leading to constrained reasoning performance. This work proposes the Channel-based Mobile Experts (CoME) architecture, which employs an output-oriented activation mechanism to dynamically invoke specialized expert modules. To mitigate error propagation and enhance the informational value of intermediate reasoning steps, CoME incorporates a three-stage progressive fine-tuning strategy (Expert-FT, Router-FT, CoT-FT) and an information gain–driven Direct Preference Optimization (DPO) method. Evaluated on the AITZ and AMEX benchmarks, CoME substantially outperforms dense models and existing Mixture-of-Experts approaches, demonstrating its superiority in hybrid-capability reasoning tasks.
📝 Abstract
Mobile Agents can autonomously execute user instructions, which requires hybrid-capabilities reasoning, including screen summary, subtask planning, action decision and action function. However, existing agents struggle to achieve both decoupled enhancement and balanced integration of these capabilities. To address these challenges, we propose Channel-of-Mobile-Experts (CoME), a novel agent architecture consisting of four distinct experts, each aligned with a specific reasoning stage, CoME activates the corresponding expert to generate output tokens in each reasoning stage via output-oriented activation. To empower CoME with hybrid-capabilities reasoning, we introduce a progressive training strategy: Expert-FT enables decoupling and enhancement of different experts'capability; Router-FT aligns expert activation with the different reasoning stage; CoT-FT facilitates seamless collaboration and balanced optimization across multiple capabilities. To mitigate error propagation in hybrid-capabilities reasoning, we propose InfoGain-Driven DPO (Info-DPO), which uses information gain to evaluate the contribution of each intermediate step, thereby guiding CoME toward more informative reasoning. Comprehensive experiments show that CoME outperforms dense mobile agents and MoE methods on both AITZ and AMEX datasets.