SciGA: A Comprehensive Dataset for Designing Graphical Abstracts in Academic Papers

📅 2025-07-02

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Scientific paper graphical abstracts (GAs) are critical for visual communication of key findings, yet their design demands domain expertise, and automated GA recommendation remains unexplored. To address this gap, we introduce SciGA-145k—the first large-scale academic GA dataset, comprising 145,000 papers and 1.14 million figures. We formalize a novel “intra- and inter-document GA recommendation” task and propose CAR (Coverage-Accuracy-Reliability), a fine-grained evaluation metric that transcends conventional ranking-based assessment. Leveraging large-scale data mining and credibility-weighted recommendation modeling, we develop a baseline model capable of jointly predicting figure relevance and usability for GA generation. Our released dataset and benchmark framework substantially advance automated GA recommendation performance, establishing foundational infrastructure and methodological support for AI-for-Science applications in scholarly visualization.

Technology Category

Application Category

📝 Abstract

Graphical Abstracts (GAs) play a crucial role in visually conveying the key findings of scientific papers. While recent research has increasingly incorporated visual materials such as Figure 1 as de facto GAs, their potential to enhance scientific communication remains largely unexplored. Moreover, designing effective GAs requires advanced visualization skills, creating a barrier to their widespread adoption. To tackle these challenges, we introduce SciGA-145k, a large-scale dataset comprising approximately 145,000 scientific papers and 1.14 million figures, explicitly designed for supporting GA selection and recommendation as well as facilitating research in automated GA generation. As a preliminary step toward GA design support, we define two tasks: 1) Intra-GA recommendation, which identifies figures within a given paper that are well-suited to serve as GAs, and 2) Inter-GA recommendation, which retrieves GAs from other papers to inspire the creation of new GAs. We provide reasonable baseline models for these tasks. Furthermore, we propose Confidence Adjusted top-1 ground truth Ratio (CAR), a novel recommendation metric that offers a fine-grained analysis of model behavior. CAR addresses limitations in traditional ranking-based metrics by considering cases where multiple figures within a paper, beyond the explicitly labeled GA, may also serve as GAs. By unifying these tasks and metrics, our SciGA-145k establishes a foundation for advancing visual scientific communication while contributing to the development of AI for Science.

Problem

Research questions and friction points this paper is trying to address.

Exploring untapped potential of Graphical Abstracts in scientific communication

Overcoming visualization skill barriers for effective GA design

Automating GA selection and generation using large-scale dataset

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale dataset SciGA-145k for GA research

Intra-GA and Inter-GA recommendation tasks defined

Novel metric CAR for fine-grained model analysis

🔎 Similar Papers

The Role of Graph Topology in the Performance of Biomedical Knowledge Graph Completion Models