Factuality Matters: When Image Generation and Editing Meet Structured Visuals

📅 2025-10-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current visual generation models excel at synthesizing natural images but struggle with structured visual content—such as charts and diagrams—due to challenges in compositional planning, poor text rendering, and low factual accuracy. This work presents the first systematic study of structured visual generation and editing. We propose: (1) a large-scale dataset of 1.3 million high-quality image-program pairs, constructed from executable drawing programs; (2) a lightweight multimodal architecture integrating a vision-language model (VLM) with FLUX.1 Kontext, enhanced by chain-of-thought program annotations and a three-stage training paradigm; and (3) an inference-time external augmentation mechanism. We concurrently release StructBench—a benchmark with 1,700+ challenging instances—and StructScore, a fine-grained evaluation metric. Experiments demonstrate substantial improvements in editing performance, with external reasoning augmentation proving broadly effective. All code, data, and models are publicly released.

Technology Category

Application Category

📝 Abstract
While modern visual generation models excel at creating aesthetically pleasing natural images, they struggle with producing or editing structured visuals like charts, diagrams, and mathematical figures, which demand composition planning, text rendering, and multimodal reasoning for factual fidelity. To address this, we present the first comprehensive, systematic investigation of this domain, encompassing data construction, model training, and an evaluation benchmark. First, we construct a large-scale dataset of 1.3 million high-quality structured image pairs derived from executable drawing programs and augmented with chain-of-thought reasoning annotations. Building on it, we train a unified model that integrates a VLM with FLUX.1 Kontext via a lightweight connector for enhanced multimodal understanding. A three-stage training curriculum enables progressive feature alignment, knowledge infusion, and reasoning-augmented generation, further boosted by an external reasoner at inference time. Finally, we introduce StructBench, a novel benchmark for generation and editing with over 1,700 challenging instances, and an accompanying evaluation metric, StructScore, which employs a multi-round Q&A protocol to assess fine-grained factual accuracy. Evaluations of 15 models reveal that even leading closed-source systems remain far from satisfactory. Our model attains strong editing performance, and inference-time reasoning yields consistent gains across diverse architectures. By releasing the dataset, model, and benchmark, we aim to advance unified multimodal foundations for structured visuals.
Problem

Research questions and friction points this paper is trying to address.

Addressing poor generation of structured visuals like charts and diagrams
Improving factual fidelity through multimodal reasoning and text rendering
Developing evaluation methods for factual accuracy in visual generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Constructed large-scale dataset with chain-of-thought annotations
Integrated VLM with FLUX.1 via lightweight connector
Introduced StructBench benchmark with multi-round QA evaluation