🤖 AI Summary
This work addresses three core challenges in editable multi-page document generation: semantic alignment between background and text, cross-page visual coherence, and guaranteed text readability. We propose a training-free, text-driven multi-page background generation method. Methodologically, we introduce (i) a latent-space soft masking mechanism coupled with Automated Readability Optimization (ARO), which dynamically controls background shape, transparency, and local contrast via WCAG 2.2–informed perceptual contrast modeling and smooth barrier functions; (ii) recursive context guidance and multi-page summary-to-instruction distillation to ensure thematic consistency across pages; and (iii) a hierarchical document representation—decoupling text, graphics, and background—to enable prompt-driven stylistic customization. Experiments demonstrate that our method generates multi-page documents exhibiting strong visual coherence, precise semantic alignment, and fine-grained style control, while strictly preserving high text readability—directly integrating into real-world design workflows.
📝 Abstract
We present a framework for document-centric background generation with multi-page editing and thematic continuity. To ensure text regions remain readable, we employ a emph{latent masking} formulation that softly attenuates updates in the diffusion space, inspired by smooth barrier functions in physics and numerical optimization. In addition, we introduce emph{Automated Readability Optimization (ARO)}, which automatically places semi-transparent, rounded backing shapes behind text regions. ARO determines the minimal opacity needed to satisfy perceptual contrast standards (WCAG 2.2) relative to the underlying background, ensuring readability while maintaining aesthetic harmony without human intervention. Multi-page consistency is maintained through a summarization-and-instruction process, where each page is distilled into a compact representation that recursively guides subsequent generations. This design reflects how humans build continuity by retaining prior context, ensuring that visual motifs evolve coherently across an entire document. Our method further treats a document as a structured composition in which text, figures, and backgrounds are preserved or regenerated as separate layers, allowing targeted background editing without compromising readability. Finally, user-provided prompts allow stylistic adjustments in color and texture, balancing automated consistency with flexible customization. Our training-free framework produces visually coherent, text-preserving, and thematically aligned documents, bridging generative modeling with natural design workflows.