Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device

📅 2026-02-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing unified multimodal models suffer from high computational costs and strong data dependencies, making them impractical for deployment on edge devices. This work proposes Mobile-O, the first lightweight vision-language-diffusion unified model tailored for mobile platforms. Its key innovations include a novel quadruple-based post-training format, a Mobile Conditioning Projector module leveraging depthwise separable convolutions and layer-wise alignment, and an efficient cross-modal conditioning mechanism. Experimental results demonstrate that Mobile-O achieves 74% on GenEval and outperforms Show-O and JanusFlow by 15.3% and 5.1% on average across understanding tasks, respectively. Moreover, it accelerates generation by 6× and 11× compared to these baselines, enabling single-image inference in approximately 3 seconds—marking the first real-time multimodal understanding and generation capability on mobile devices.

Technology Category

Application Category

📝 Abstract
Unified multimodal models can both understand and generate visual content within a single architecture. Existing models, however, remain data-hungry and too heavy for deployment on edge devices. We present Mobile-O, a compact vision-language-diffusion model that brings unified multimodal intelligence to a mobile device. Its core module, the Mobile Conditioning Projector (MCP), fuses vision-language features with a diffusion generator using depthwise-separable convolutions and layerwise alignment. This design enables efficient cross-modal conditioning with minimal computational cost. Trained on only a few million samples and post-trained in a novel quadruplet format (generation prompt, image, question, answer), Mobile-O jointly enhances both visual understanding and generation capabilities. Despite its efficiency, Mobile-O attains competitive or superior performance compared to other unified models, achieving 74% on GenEval and outperforming Show-O and JanusFlow by 5% and 11%, while running 6x and 11x faster, respectively. For visual understanding, Mobile-O surpasses them by 15.3% and 5.1% averaged across seven benchmarks. Running in only ~3s per 512x512 image on an iPhone, Mobile-O establishes the first practical framework for real-time unified multimodal understanding and generation on edge devices. We hope Mobile-O will ease future research in real-time unified multimodal intelligence running entirely on-device with no cloud dependency. Our code, models, datasets, and mobile application are publicly available at https://amshaker.github.io/Mobile-O/
Problem

Research questions and friction points this paper is trying to address.

multimodal understanding
multimodal generation
mobile deployment
edge devices
unified models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mobile Conditioning Projector
unified multimodal model
on-device inference
diffusion generator
cross-modal conditioning
🔎 Similar Papers
No similar papers found.