đ¤ AI Summary
Existing biomedical vision-language pretraining primarily relies on coarse-grained image-text pairing, neglecting fine-grained, region-level structural alignment essential for clinical interpretation.
Method: We propose Panel2Patch, a novel data pipeline that decomposes multi-panel, annotated biomedical figures into three hierarchical supervision granularitiesâfigure, panel, and patchâexplicitly modeling cross-level image-text correspondences. Our approach integrates layout parsing with visual token recognition and introduces a hierarchical image-text alignment strategy, incorporating cross-granularity contrastive learning and joint optimization.
Contribution/Results: Panel2Patch generates high-information-density supervision signals from only a small set of literature figures, substantially reducing pretraining data requirements. It significantly improves model performance on fine-grained understanding, region localization, and downstream tasksâincluding figure question answering and lesion description generationâestablishing a new paradigm for multi-granularity biomedical vision-language modeling.
đ Abstract
There is a growing interest in developing strong biomedical vision-language models. A popular approach to achieve robust representations is to use web-scale scientific data. However, current biomedical vision-language pretraining typically compresses rich scientific figures and text into coarse figure-level pairs, discarding the fine-grained correspondences that clinicians actually rely on when zooming into local structures. To tackle this issue, we introduce Panel2Patch, a novel data pipeline that mines hierarchical structure from existing biomedical scientific literature, i.e., multi-panel, marker-heavy figures and their surrounding text, and converts them into multi-granular supervision. Given scientific figures and captions, Panel2Patch parses layouts, panels, and visual markers, then constructs hierarchical aligned vision-language pairs at the figure, panel, and patch levels, preserving local semantics instead of treating each figure as a single data sample. Built on this hierarchical corpus, we develop a granularity-aware pretraining strategy that unifies heterogeneous objectives from coarse didactic descriptions to fine region-focused phrases. By applying Panel2Patch to only a small set of the literature figures, we extract far more effective supervision than prior pipelines, enabling substantially better performance with less pretraining data.