π€ AI Summary
This work addresses the limitations of existing document generation methods, which are constrained by the scale and diversity of human-curated datasets, often leading to poor support for unified multi-category, multi-style generation and content overflow. To overcome these challenges, the authors propose a unified multitask document generation framework, introducing DocHTMLβa large-scale synthetic HTML/CSS dataset encompassing 111 categories and 32 styles. The approach integrates multimodal large language model fine-tuning with document rendering and inverse rendering techniques, and introduces height-aware reinforcement learning (HARL), a novel mechanism that leverages document height discrepancies to optimize layout. Experimental results demonstrate that the proposed method significantly outperforms both general-purpose multimodal models and specialized baselines across three tasks: intent-to-document, document inverse rendering, and element-to-document generation, achieving superior generation quality and enhanced layout control.
π Abstract
Document generation has gained growing attention in the field of AI-driven content creation. In this work, we push its boundaries by introducing AnyDoc, a framework capable of handling multiple generation tasks across a wide spectrum of document categories, all represented in a unified HTML/CSS format. To overcome the limited coverage and scale of existing human-crafted document datasets, AnyDoc first establishes a scalable data synthesis pipeline to automatically generate documents in HTML/CSS form. This pipeline yields DocHTML, a large-scale dataset containing 265,206 document samples, while spanning 111 categories and 32 distinct styles. Additionally, all documents are equipped with comprehensive metadata, including design intentions, HTML/CSS source code, visual assets, and rendered screenshots. Building on the curated dataset, AnyDoc fine-tunes multi-modal large language models (MLLMs) to achieve three practical document generation tasks: intention-to-document, document derendering, and element-to-document. To address the content overflow issue observed during fine-tuning, AnyDoc further incorporates a height-aware reinforcement learning (HARL) post-training procedure. By defining a reward function based on the difference between predicted and target document heights, overflow is penalized and gradually mitigated during HARL, thereby enhancing overall performance. Qualitative and quantitative experiments demonstrate that AnyDoc outperforms both general-purpose MLLMs and task-specific baselines across all three tasks.