FigEx2: Visual-Conditioned Panel Detection and Captioning for Scientific Compound Figures

📅 2026-01-12

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Scientific composite figures often lack panel-level annotations, providing only image-level summaries that hinder fine-grained understanding. To address this, this work proposes the FigEx2 framework, which jointly performs panel localization and caption generation through vision-conditioned guidance, optimized via a multi-stage strategy combining supervised and reinforcement learning. The approach introduces a novel noise-aware gated fusion module to stabilize the detection query space and constructs BioSci-Fig-Cap, a high-quality cross-disciplinary benchmark dataset enabling zero-shot cross-domain transfer. Built upon CLIP alignment, BERTScore-based semantic rewards, and a vision-conditioned Transformer architecture, FigEx2 achieves 0.726 mAP@0.5:0.95 on detection and outperforms Qwen3-VL-8B by 0.51 in METEOR and 0.24 in BERTScore, demonstrating exceptional zero-shot generalization capability.

Technology Category

Application Category

📝 Abstract

Scientific compound figures combine multiple labeled panels into a single image, but captions in real pipelines are often missing or only provide figure-level summaries, making panel-level understanding difficult. In this paper, we propose FigEx2, visual-conditioned framework that localizes panels and generates panel-wise captions directly from the compound figure. To mitigate the impact of diverse phrasing in open-ended captioning, we introduce a noise-aware gated fusion module that adaptively filters token-level features to stabilize the detection query space. Furthermore, we employ a staged optimization strategy combining supervised learning with reinforcement learning (RL), utilizing CLIP-based alignment and BERTScore-based semantic rewards to enforce strict multimodal consistency. To support high-quality supervision, we curate BioSci-Fig-Cap, a refined benchmark for panel-level grounding, alongside cross-disciplinary test suites in physics and chemistry. Experimental results demonstrate that FigEx2 achieves a superior 0.726 mAP@0.5:0.95 for detection and significantly outperforms Qwen3-VL-8B by 0.51 in METEOR and 0.24 in BERTScore. Notably, FigEx2 exhibits remarkable zero-shot transferability to out-of-distribution scientific domains without any fine-tuning.

Problem

Research questions and friction points this paper is trying to address.

scientific compound figures

panel detection

captioning

multimodal understanding

figure interpretation

Innovation

Methods, ideas, or system contributions that make the work stand out.

visual-conditioned panel detection

noise-aware gated fusion

staged optimization with RL