🤖 AI Summary
Existing vision–language datasets suffer from limitations in scale, depth of knowledge, and image–text alignment, hindering the application of multimodal AI in visual design. This work proposes Feynman, a scalable diagram-generation agent that uniquely integrates knowledge-driven code planning with optimized rendering. By combining domain-knowledge enumeration, declarative program synthesis, Penrose-based rendering, and feedback-driven iterative refinement, Feynman automatically produces high-quality image–text pairs that are semantically consistent yet exhibit diverse layouts. The system synthesizes over 100,000 such pairs, enabling the creation of Diagramma—the first visual reasoning benchmark built entirely on synthetic data. The authors will release the full dataset, benchmark, and agent pipeline to support future research in multimodal visual understanding and generation.
📝 Abstract
Visual design is an essential application of state-of-the-art multi-modal AI systems. Improving these systems requires high-quality vision-language data at scale. Despite the abundance of internet image and text data, knowledge-rich and well-aligned image-text pairs are rare. In this paper, we present a scalable diagram generation pipeline built with our agent, Feynman. To create diagrams, Feynman first enumerates domain-specific knowledge components (''ideas'') and performs code planning based on the ideas. Given the plan, Feynman translates ideas into simple declarative programs and iterates to receives feedback and visually refine diagrams. Finally, the declarative programs are rendered by the Penrose diagramming system. The optimization-based rendering of Penrose preserves the visual semantics while injecting fresh randomness into the layout, thereby producing diagrams with visual consistency and diversity. As a result, Feynman can author diagrams along with grounded captions with very little cost and time. Using Feynman, we synthesized a dataset with more than 100k well-aligned diagram-caption pairs. We also curate a visual-language benchmark, Diagramma, from freshly generated data. Diagramma can be used for evaluating the visual reasoning capabilities of vision-language models. We plan to release the dataset, benchmark, and the full agent pipeline as an open-source project.