Home-made Diffusion Model from Scratch to Hatch

📅 2025-09-07

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Training high-fidelity text-to-image diffusion models on consumer-grade hardware remains computationally prohibitive. To address this, we propose HDM—a lightweight diffusion model optimized for resource-constrained settings. Methodologically, HDM introduces three key innovations: (1) the Cross-U-Transformer (XUT) architecture, which enhances multi-scale feature fusion in U-Net via cross-attention, enabling compact models to acquire advanced capabilities such as camera control; (2) the TREAD training acceleration framework; (3) shift-square cropping—supporting arbitrary aspect ratios—and progressive resolution scaling, collectively boosting training efficiency and 1024×1024 image fidelity. Evaluated on just four RTX 5090 GPUs at a total cost of $535–$620, HDM achieves generation quality competitive with state-of-the-art large-scale models. This work significantly lowers the barrier to entry for high-quality text-to-image synthesis, enabling broader adoption by individual researchers and small institutions.

Technology Category

Application Category

📝 Abstract

We introduce Home-made Diffusion Model (HDM), an efficient yet powerful text-to-image diffusion model optimized for training (and inferring) on consumer-grade hardware. HDM achieves competitive 1024x1024 generation quality while maintaining a remarkably low training cost of $535-620 using four RTX5090 GPUs, representing a significant reduction in computational requirements compared to traditional approaches. Our key contributions include: (1) Cross-U-Transformer (XUT), a novel U-shape transformer, Cross-U-Transformer (XUT), that employs cross-attention for skip connections, providing superior feature integration that leads to remarkable compositional consistency; (2) a comprehensive training recipe that incorporates TREAD acceleration, a novel shifted square crop strategy for efficient arbitrary aspect-ratio training, and progressive resolution scaling; and (3) an empirical demonstration that smaller models (343M parameters) with carefully crafted architectures can achieve high-quality results and emergent capabilities, such as intuitive camera control. Our work provides an alternative paradigm of scaling, demonstrating a viable path toward democratizing high-quality text-to-image generation for individual researchers and smaller organizations with limited computational resources.

Problem

Research questions and friction points this paper is trying to address.

Reducing computational cost for text-to-image diffusion models

Achieving high-quality image generation on consumer hardware

Democratizing AI image synthesis for limited-resource researchers

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-U-Transformer with cross-attention skip connections

TREAD acceleration and progressive resolution scaling

Small 343M parameter model achieving high-quality generation

🔎 Similar Papers

No similar papers found.