AnyDoc: Enhancing Document Generation via Large-Scale HTML/CSS Data Synthesis and Height-Aware Reinforcement Optimization

📅 2026-03-26

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

This work addresses the limitations of existing document generation methods, which are constrained by the scale and diversity of human-curated datasets, often leading to poor support for unified multi-category, multi-style generation and content overflow. To overcome these challenges, the authors propose a unified multitask document generation framework, introducing DocHTML—a large-scale synthetic HTML/CSS dataset encompassing 111 categories and 32 styles. The approach integrates multimodal large language model fine-tuning with document rendering and inverse rendering techniques, and introduces height-aware reinforcement learning (HARL), a novel mechanism that leverages document height discrepancies to optimize layout. Experimental results demonstrate that the proposed method significantly outperforms both general-purpose multimodal models and specialized baselines across three tasks: intent-to-document, document inverse rendering, and element-to-document generation, achieving superior generation quality and enhanced layout control.

Technology Category

Application Category

📝 Abstract

Document generation has gained growing attention in the field of AI-driven content creation. In this work, we push its boundaries by introducing AnyDoc, a framework capable of handling multiple generation tasks across a wide spectrum of document categories, all represented in a unified HTML/CSS format. To overcome the limited coverage and scale of existing human-crafted document datasets, AnyDoc first establishes a scalable data synthesis pipeline to automatically generate documents in HTML/CSS form. This pipeline yields DocHTML, a large-scale dataset containing 265,206 document samples, while spanning 111 categories and 32 distinct styles. Additionally, all documents are equipped with comprehensive metadata, including design intentions, HTML/CSS source code, visual assets, and rendered screenshots. Building on the curated dataset, AnyDoc fine-tunes multi-modal large language models (MLLMs) to achieve three practical document generation tasks: intention-to-document, document derendering, and element-to-document. To address the content overflow issue observed during fine-tuning, AnyDoc further incorporates a height-aware reinforcement learning (HARL) post-training procedure. By defining a reward function based on the difference between predicted and target document heights, overflow is penalized and gradually mitigated during HARL, thereby enhancing overall performance. Qualitative and quantitative experiments demonstrate that AnyDoc outperforms both general-purpose MLLMs and task-specific baselines across all three tasks.

Problem

Research questions and friction points this paper is trying to address.

document generation

data scarcity

content overflow

HTML/CSS synthesis

multi-modal generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

HTML/CSS data synthesis

multi-modal large language models

height-aware reinforcement learning