FigEx2: Visual-Conditioned Panel Detection and Captioning for Scientific Compound Figures

📅 2026-01-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Scientific composite figures often lack panel-level annotations, providing only image-level summaries that hinder fine-grained understanding. To address this, this work proposes the FigEx2 framework, which jointly performs panel localization and caption generation through vision-conditioned guidance, optimized via a multi-stage strategy combining supervised and reinforcement learning. The approach introduces a novel noise-aware gated fusion module to stabilize the detection query space and constructs BioSci-Fig-Cap, a high-quality cross-disciplinary benchmark dataset enabling zero-shot cross-domain transfer. Built upon CLIP alignment, BERTScore-based semantic rewards, and a vision-conditioned Transformer architecture, FigEx2 achieves 0.726 mAP@0.5:0.95 on detection and outperforms Qwen3-VL-8B by 0.51 in METEOR and 0.24 in BERTScore, demonstrating exceptional zero-shot generalization capability.

Technology Category

Application Category

📝 Abstract
Scientific compound figures combine multiple labeled panels into a single image, but captions in real pipelines are often missing or only provide figure-level summaries, making panel-level understanding difficult. In this paper, we propose FigEx2, visual-conditioned framework that localizes panels and generates panel-wise captions directly from the compound figure. To mitigate the impact of diverse phrasing in open-ended captioning, we introduce a noise-aware gated fusion module that adaptively filters token-level features to stabilize the detection query space. Furthermore, we employ a staged optimization strategy combining supervised learning with reinforcement learning (RL), utilizing CLIP-based alignment and BERTScore-based semantic rewards to enforce strict multimodal consistency. To support high-quality supervision, we curate BioSci-Fig-Cap, a refined benchmark for panel-level grounding, alongside cross-disciplinary test suites in physics and chemistry. Experimental results demonstrate that FigEx2 achieves a superior 0.726 mAP@0.5:0.95 for detection and significantly outperforms Qwen3-VL-8B by 0.51 in METEOR and 0.24 in BERTScore. Notably, FigEx2 exhibits remarkable zero-shot transferability to out-of-distribution scientific domains without any fine-tuning.
Problem

Research questions and friction points this paper is trying to address.

scientific compound figures
panel detection
captioning
multimodal understanding
figure interpretation
Innovation

Methods, ideas, or system contributions that make the work stand out.

visual-conditioned panel detection
noise-aware gated fusion
staged optimization with RL
multimodal consistency
zero-shot transferability
🔎 Similar Papers
No similar papers found.
J
Jifeng Song
Department of Electrical and Computer Engineering, University of Pittsburgh, USA
Arun Das
Arun Das
Postdoctoral Associate at University of Pittsburgh Medical Center (UPMC)
Artificial IntelligenceSpatial TranscriptomicsComputer VisionExplainable AIDistributed Systems
P
Pan Wang
Department of Electrical and Computer Engineering, University of Pittsburgh, USA
H
Hui Ji
Department of Informatics and Networked Systems, University of Pittsburgh, USA
Kun Zhao
Kun Zhao
University of Pittsburgh
NLP
Yufei Huang
Yufei Huang
University of Pittsburg Medical Center
Spatial transcriptomicsmRNA methylationCancer genomicsAI for cancer biology