🤖 AI Summary
This work addresses the challenge of performing fine-grained, layer-aware natural language editing on multi-layer design documents such as posters, a task hindered by the lack of joint reasoning over *what* to edit and *where* to apply changes. To this end, we propose MiLDEAgent, the first framework equipped with a reasoning mechanism tailored for multi-layer documents, integrating a multimodal reasoning module trained via reinforcement learning with a layer-aware editor to enable precise, targeted modifications. Our contributions include MiLDEBench—a human-AI collaborative benchmark comprising over 20,000 samples—and MiLDEEval, a four-dimensional automatic evaluation protocol, together establishing a comprehensive data construction and evaluation pipeline. Experiments demonstrate that MiLDEAgent significantly outperforms open-source baselines and approaches the performance of closed-source models in instruction following, layout consistency, aesthetics, and text rendering, establishing the first strong baseline for this task.
📝 Abstract
Real-world design documents (e.g., posters) are inherently multi-layered, combining decoration, text, and images. Editing them from natural-language instructions requires fine-grained, layer-aware reasoning to identify relevant layers and coordinate modifications. Prior work largely overlooks multi-layer design document editing, focusing instead on single-layer image editing or multi-layer generation, which assume a flat canvas and lack the reasoning needed to determine what and where to modify. To address this gap, we introduce the Multi-Layer Document Editing Agent (MiLDEAgent), a reasoning-based framework that combines an RL-trained multimodal reasoner for layer-wise understanding with an image editor for targeted modifications. To systematically benchmark this setting, we introduce the MiLDEBench, a human-in-the-loop corpus of over 20K design documents paired with diverse editing instructions. The benchmark is complemented by a task-specific evaluation protocol, MiLDEEval, which spans four dimensions including instruction following, layout consistency, aesthetics, and text rendering. Extensive experiments on 14 open-source and 2 closed-source models reveal that existing approaches fail to generalize: open-source models often cannot complete multi-layer document editing tasks, while closed-source models suffer from format violations. In contrast, MiLDEAgent achieves strong layer-aware reasoning and precise editing, significantly outperforming all open-source baselines and attaining performance comparable to closed-source models, thereby establishing the first strong baseline for multi-layer document editing.