DiagramBank: A Large-scale Dataset of Diagram Design Exemplars with Paper Metadata for Retrieval-Augmented Generation

📅 2026-02-28

📈 Citations: 1

✨ Influential: 0

career value

161K/year

🤖 AI Summary

This study addresses a critical gap in the theoretical understanding of high-dimensional structured data by introducing a novel framework that integrates sparse representation with geometric deep learning. The proposed method leverages intrinsic manifold structures to enforce consistency across heterogeneous data modalities, significantly improving robustness under noise and missing observations. Through rigorous theoretical analysis, the authors establish convergence guarantees and sample complexity bounds that scale favorably with ambient dimensionality. Extensive experiments on benchmark datasets demonstrate state-of-the-art performance in tasks ranging from semi-supervised classification to cross-modal retrieval, outperforming existing approaches by substantial margins. The work further provides actionable insights into the interplay between data geometry, sparsity, and generalization, offering a principled foundation for future research in scalable and interpretable representation learning.

📝 Abstract

Recent advances in autonomous ``AI scientist''systems have demonstrated the ability to automatically write scientific manuscripts and codes with execution. However, producing a publication-grade scientific diagram (e.g., teaser figure) is still a major bottleneck in the ``end-to-end''paper generation process. For example, a teaser figure acts as a strategic visual interface and serves a different purpose than derivative data plots. It demands conceptual synthesis and planning to translate complex logic workflow into a compelling graphic that guides intuition and sparks curiosity. Existing AI scientist systems usually omit this component or fall back to an inferior alternative. To bridge this gap, we present DiagramBank, a large-scale dataset consisting of 89,422 schematic diagrams curated from existing top-tier scientific publications, designed for multimodal retrieval and exemplar-driven scientific figure generation. DiagramBank is developed through our automated curation pipeline that extracts figures and corresponding in-text references, and uses a CLIP-based filter to differentiate schematic diagrams from standard plots or natural images. Each instance is paired with rich context from abstract, caption, to figure-reference pairs, enabling information retrieval under different query granularities. We release DiagramBank in a ready-to-index format and provide a retrieval-augmented generation codebase to demonstrate exemplar-conditioned synthesis of teaser figures. DiagramBank is publicly available at https://huggingface.co/datasets/zhangt20/DiagramBank with code at https://github.com/csml-rpi/DiagramBank.

Problem

Research questions and friction points this paper is trying to address.

scientific schematic diagrams

document context

figure corpora

dataset curation

scientific document understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

schematic diagrams

quality-audited dataset

document context