Do Large Multimodal Models Solve Caption Generation for Scientific Figures? Lessons Learned from SCICAP Challenge 2023

📅 2025-01-31

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This study systematically evaluates large multimodal models (LMMs) on cross-disciplinary scientific chart captioning—a challenging task requiring precise technical understanding, causal reasoning, and domain-specific conventions. Method: Leveraging the SCICAP Challenge 2023 benchmark—a large-scale, multi-domain dataset—we conduct end-to-end captioning evaluation and attribution analysis on state-of-the-art models (e.g., GPT-4V). Crucially, we introduce a novel expert editor blind evaluation protocol to overcome limitations of conventional automatic metrics. Contribution/Results: GPT-4V achieves statistically significant superiority over both original author captions and all baselines in human preference studies, demonstrating meaningful progress in scientific comprehension and expression. However, fine-grained analysis reveals persistent systematic weaknesses in technical accuracy, causal inference, and adherence to disciplinary norms. Our work not only validates the promise of LMMs for scientific image understanding but also critically contends that the task remains fundamentally unsolved—providing a reproducible benchmark and diagnostic framework to guide future research.

Technology Category

Application Category

📝 Abstract

Since the SCICAP datasets launch in 2021, the research community has made significant progress in generating captions for scientific figures in scholarly articles. In 2023, the first SCICAP Challenge took place, inviting global teams to use an expanded SCICAP dataset to develop models for captioning diverse figure types across various academic fields. At the same time, text generation models advanced quickly, with many powerful pre-trained large multimodal models (LMMs) emerging that showed impressive capabilities in various vision-and-language tasks. This paper presents an overview of the first SCICAP Challenge and details the performance of various models on its data, capturing a snapshot of the fields state. We found that professional editors overwhelmingly preferred figure captions generated by GPT-4V over those from all other models and even the original captions written by authors. Following this key finding, we conducted detailed analyses to answer this question: Have advanced LMMs solved the task of generating captions for scientific figures?

Problem

Research questions and friction points this paper is trying to address.

Large Models

Automatic Caption Generation

Interdisciplinary Scientific Diagrams

Innovation

Methods, ideas, or system contributions that make the work stand out.

GPT-4V

Multi-modal Models

Automated Scientific Captioning

🔎 Similar Papers

No similar papers found.