From Panel to Pixel: Zoom-In Vision-Language Pretraining from Biomedical Scientific Literature

📅 2025-12-02

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Existing biomedical vision-language pretraining primarily relies on coarse-grained image-text pairing, neglecting fine-grained, region-level structural alignment essential for clinical interpretation. Method: We propose Panel2Patch, a novel data pipeline that decomposes multi-panel, annotated biomedical figures into three hierarchical supervision granularities—figure, panel, and patch—explicitly modeling cross-level image-text correspondences. Our approach integrates layout parsing with visual token recognition and introduces a hierarchical image-text alignment strategy, incorporating cross-granularity contrastive learning and joint optimization. Contribution/Results: Panel2Patch generates high-information-density supervision signals from only a small set of literature figures, substantially reducing pretraining data requirements. It significantly improves model performance on fine-grained understanding, region localization, and downstream tasks—including figure question answering and lesion description generation—establishing a new paradigm for multi-granularity biomedical vision-language modeling.

Technology Category

Application Category

📝 Abstract

There is a growing interest in developing strong biomedical vision-language models. A popular approach to achieve robust representations is to use web-scale scientific data. However, current biomedical vision-language pretraining typically compresses rich scientific figures and text into coarse figure-level pairs, discarding the fine-grained correspondences that clinicians actually rely on when zooming into local structures. To tackle this issue, we introduce Panel2Patch, a novel data pipeline that mines hierarchical structure from existing biomedical scientific literature, i.e., multi-panel, marker-heavy figures and their surrounding text, and converts them into multi-granular supervision. Given scientific figures and captions, Panel2Patch parses layouts, panels, and visual markers, then constructs hierarchical aligned vision-language pairs at the figure, panel, and patch levels, preserving local semantics instead of treating each figure as a single data sample. Built on this hierarchical corpus, we develop a granularity-aware pretraining strategy that unifies heterogeneous objectives from coarse didactic descriptions to fine region-focused phrases. By applying Panel2Patch to only a small set of the literature figures, we extract far more effective supervision than prior pipelines, enabling substantially better performance with less pretraining data.

Problem

Research questions and friction points this paper is trying to address.

Develops biomedical vision-language models with fine-grained figure-text correspondences

Introduces Panel2Patch pipeline for hierarchical multi-granular supervision from literature

Enables better performance with less pretraining data via granularity-aware strategies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical data pipeline for fine-grained vision-language alignment

Multi-granular pretraining strategy unifying coarse and fine objectives

Parsing scientific figures into figure-panel-patch levels for local semantics

🔎 Similar Papers

No similar papers found.