🤖 AI Summary
Scientific composite figures often lack panel-level annotations, providing only image-level summaries that hinder fine-grained understanding. To address this, this work proposes the FigEx2 framework, which jointly performs panel localization and caption generation through vision-conditioned guidance, optimized via a multi-stage strategy combining supervised and reinforcement learning. The approach introduces a novel noise-aware gated fusion module to stabilize the detection query space and constructs BioSci-Fig-Cap, a high-quality cross-disciplinary benchmark dataset enabling zero-shot cross-domain transfer. Built upon CLIP alignment, BERTScore-based semantic rewards, and a vision-conditioned Transformer architecture, FigEx2 achieves 0.726 mAP@0.5:0.95 on detection and outperforms Qwen3-VL-8B by 0.51 in METEOR and 0.24 in BERTScore, demonstrating exceptional zero-shot generalization capability.
📝 Abstract
Scientific compound figures combine multiple labeled panels into a single image, but captions in real pipelines are often missing or only provide figure-level summaries, making panel-level understanding difficult. In this paper, we propose FigEx2, visual-conditioned framework that localizes panels and generates panel-wise captions directly from the compound figure. To mitigate the impact of diverse phrasing in open-ended captioning, we introduce a noise-aware gated fusion module that adaptively filters token-level features to stabilize the detection query space. Furthermore, we employ a staged optimization strategy combining supervised learning with reinforcement learning (RL), utilizing CLIP-based alignment and BERTScore-based semantic rewards to enforce strict multimodal consistency. To support high-quality supervision, we curate BioSci-Fig-Cap, a refined benchmark for panel-level grounding, alongside cross-disciplinary test suites in physics and chemistry. Experimental results demonstrate that FigEx2 achieves a superior 0.726 mAP@0.5:0.95 for detection and significantly outperforms Qwen3-VL-8B by 0.51 in METEOR and 0.24 in BERTScore. Notably, FigEx2 exhibits remarkable zero-shot transferability to out-of-distribution scientific domains without any fine-tuning.