🤖 AI Summary
Diffusion model image generation faces dual challenges: high computational overhead and leakage of sensitive user prompt information. To address these, we propose a privacy-first cloud-edge collaborative generation framework. On the cloud, a backbone denoising model processes a set of semantically equivalent but sensitive-attribute–anonymized candidate prompts; on the edge, a lightweight denoiser completes the remaining generation steps. Latent variable transmission and time/batch-level redundancy caching further accelerate inference. Crucially, the framework requires no trust in the cloud server and provides rigorous differential privacy guarantees while preserving semantic fidelity. Experiments across multiple benchmark datasets demonstrate that our method achieves generation quality comparable to full-cloud models, incurs less than 8% additional server overhead, and maintains controllable edge-side latency. To the best of our knowledge, this is the first approach to effectively balance practicality, efficiency, and strong privacy in diffusion-based image generation.
📝 Abstract
Diffusion Models have gained significant popularity due to their remarkable capabilities in image generation, albeit at the cost of intensive computation requirement. Meanwhile, despite their widespread deployment in inference services such as Midjourney, concerns about the potential leakage of sensitive information in uploaded user prompts have arisen. Existing solutions either lack rigorous privacy guarantees or fail to strike an effective balance between utility and efficiency. To bridge this gap, we propose ObCLIP, a plug-and-play safeguard that enables oblivious cloud-device hybrid generation. By oblivious, each input prompt is transformed into a set of semantically similar candidate prompts that differ only in sensitive attributes (e.g., gender, ethnicity). The cloud server processes all candidate prompts without knowing which one is the real one, thus preventing any prompt leakage. To mitigate server cost, only a small portion of denoising steps is performed upon the large cloud model. The intermediate latents are then sent back to the client, which selects the targeted latent and completes the remaining denoising using a small device model. Additionally, we analyze and incorporate several cache-based accelerations that leverage temporal and batch redundancy, effectively reducing computation cost with minimal utility degradation. Extensive experiments across multiple datasets demonstrate that ObCLIP provides rigorous privacy and comparable utility to cloud models with slightly increased server cost.