Omni-Dish: Photorealistic and Faithful Image Generation and Editing for Arbitrary Chinese Dishes

📅 2025-04-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing general-purpose text-to-image models suffer from detail distortion, cultural feature omission, and low fidelity when generating Chinese cuisine imagery. To address these limitations, we propose the first diffusion-based generative model specifically designed for Chinese culinary visualization. Our method introduces four key innovations: (1) construction of the largest publicly available Chinese dish dataset, meticulously annotated by regional cuisine categories and refined via human re-captioning; (2) a coarse-to-fine two-stage training paradigm; (3) an LLM-driven high-quality prompt enhancement mechanism; and (4) a concept-augmented Prompt-to-Prompt local editing framework. Extensive experiments demonstrate that our model consistently outperforms state-of-the-art methods across fidelity, cultural accuracy, and editing controllability. Notably, it achieves significant improvements in multi-cuisine representation, complex plating composition, and fine-grained ingredient texture synthesis—enabling both photorealistic generation and semantically coherent local edits.

Technology Category

Application Category

📝 Abstract
Dish images play a crucial role in the digital era, with the demand for culturally distinctive dish images continuously increasing due to the digitization of the food industry and e-commerce. In general cases, existing text-to-image generation models excel in producing high-quality images; however, they struggle to capture diverse characteristics and faithful details of specific domains, particularly Chinese dishes. To address this limitation, we propose Omni-Dish, the first text-to-image generation model specifically tailored for Chinese dishes. We develop a comprehensive dish curation pipeline, building the largest dish dataset to date. Additionally, we introduce a recaption strategy and employ a coarse-to-fine training scheme to help the model better learn fine-grained culinary nuances. During inference, we enhance the user's textual input using a pre-constructed high-quality caption library and a large language model, enabling more photorealistic and faithful image generation. Furthermore, to extend our model's capability for dish editing tasks, we propose Concept-Enhanced P2P. Based on this approach, we build a dish editing dataset and train a specialized editing model. Extensive experiments demonstrate the superiority of our methods.
Problem

Research questions and friction points this paper is trying to address.

Generating photorealistic images of Chinese dishes
Capturing diverse characteristics of specific culinary domains
Enabling faithful dish image editing with specialized models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Largest Chinese dish dataset for training
Recaption strategy and coarse-to-fine training
Concept-Enhanced P2P for dish editing
🔎 Similar Papers
No similar papers found.
H
Huijie Liu
Meituan, China; Beihang University, China
B
Bingcan Wang
Meituan, China
J
Jie Hu
Meituan, China
Xiaoming Wei
Xiaoming Wei
Meituan
computer visionmachine learning
Guoliang Kang
Guoliang Kang
Professor, Beihang University
Deep learning and its applications