DreamLite: A Lightweight On-Device Unified Model for Image Generation and Editing

📅 2026-03-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing on-device diffusion models suffer from large parameter counts, high latency, and a predominant focus on text-to-image generation, lacking unified capabilities for both image generation and editing. This work proposes DreamLite—the first lightweight, unified diffusion model (only 0.39B parameters) that supports both tasks—leveraging a latent-space context concatenation mechanism to unify generation and editing within a single network. The approach integrates a pruned mobile-friendly U-Net backbone, task-progressive joint pretraining, supervised fine-tuning, reinforcement learning, and a 4-step distillation-based denoising strategy. On a Xiaomi 14 device, DreamLite generates or edits 1024×1024 images in under one second, achieving a GenEval score of 0.72 and an ImgEdit score of 4.11, outperforming existing on-device models and rivaling select server-side counterparts.
📝 Abstract
Diffusion models have made significant progress in both text-to-image (T2I) generation and text-guided image editing. However, these models are typically built with billions of parameters, leading to high latency and increased deployment challenges. While on-device diffusion models improve efficiency, they largely focus on T2I generation and lack support for image editing. In this paper, we propose DreamLite, a compact unified on-device diffusion model (0.39B) that supports both T2I generation and text-guided image editing within a single network. DreamLite is built on a pruned mobile U-Net backbone and unifies conditioning through in-context spatial concatenation in the latent space. It concatenates images horizontally as input, using a (target | blank) configuration for generation tasks and (target | source) for editing tasks. To stabilize the training of this compact model, we introduce a task-progressive joint pretraining strategy that sequentially targets T2I, editing, and joint tasks. After high-quality SFT and reinforcement learning, DreamLite achieves GenEval (0.72) for image generation and ImgEdit (4.11) for image editing, outperforming existing on-device models and remaining competitive with several server-side models. By employing step distillation, we further reduce denoising processing to just 4 steps, enabling our DreamLite could generate or edit a 1024 x 1024 image in less than 1s on a Xiaomi 14 smartphone. To the best of our knowledge, DreamLite is the first unified on-device diffusion model that supports both image generation and image editing.
Problem

Research questions and friction points this paper is trying to address.

on-device diffusion models
text-to-image generation
text-guided image editing
model unification
computational efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

on-device diffusion model
unified image generation and editing
task-progressive pretraining
step distillation
mobile U-Net
🔎 Similar Papers
No similar papers found.