π€ AI Summary
This work addresses two core challenges in article-level visual-text renderingβe.g., infographic and slide generation: (1) difficulty in modeling long-range contextual dependencies, and (2) scarcity of high-quality, commercially relevant multimodal data. To this end, we propose the first ultra-dense layout-guided cross-region attention mechanism, alongside the large-scale commercial infographic dataset Infographics-650K and the benchmark BizEval. Methodologically, our approach integrates layer-wise retrieval-augmented generation, layout-aware cross-attention, region-conditioned classifier-free guidance (CFG) fine-tuning, and latent-space modeling of cropped regions. On BizEval, our method significantly outperforms state-of-the-art models including Flux and SD3. Ablation studies confirm the efficacy of each component. All code, datasets, and evaluation benchmarks are publicly released, establishing a new paradigm and foundational resource for commercial-grade visual content generation research.
π Abstract
Recently, state-of-the-art text-to-image generation models, such as Flux and Ideogram 2.0, have made significant progress in sentence-level visual text rendering. In this paper, we focus on the more challenging scenarios of article-level visual text rendering and address a novel task of generating high-quality business content, including infographics and slides, based on user provided article-level descriptive prompts and ultra-dense layouts. The fundamental challenges are twofold: significantly longer context lengths and the scarcity of high-quality business content data. In contrast to most previous works that focus on a limited number of sub-regions and sentence-level prompts, ensuring precise adherence to ultra-dense layouts with tens or even hundreds of sub-regions in business content is far more challenging. We make two key technical contributions: (i) the construction of scalable, high-quality business content dataset, i.e., Infographics-650K, equipped with ultra-dense layouts and prompts by implementing a layer-wise retrieval-augmented infographic generation scheme; and (ii) a layout-guided cross attention scheme, which injects tens of region-wise prompts into a set of cropped region latent space according to the ultra-dense layouts, and refine each sub-regions flexibly during inference using a layout conditional CFG. We demonstrate the strong results of our system compared to previous SOTA systems such as Flux and SD3 on our BizEval prompt set. Additionally, we conduct thorough ablation experiments to verify the effectiveness of each component. We hope our constructed Infographics-650K and BizEval can encourage the broader community to advance the progress of business content generation.