No Tokens Wasted: Leveraging Long Context in Biomedical Vision-Language Models

📅 2025-10-04

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Biomedical vision-language models (VLMs) are constrained by short text encoders (typically <77 tokens), leading to truncation of lengthy biomedical literature descriptions and substantial semantic information loss. To address this, we introduce a long-context modeling framework: (1) constructing BIOMEDICA-LongCAP, a million-scale biomedical image–caption dataset with extended textual descriptions; (2) designing BMC-LongCLIP, a VLM featuring a 512-token text encoder while retaining a lightweight image encoder; and (3) performing contrastive learning pretraining on this data. Our key contribution is the first systematic extension of biomedical VLM text context to 512 tokens—enhancing cross-modal semantic alignment without compromising image encoder efficiency. Experiments demonstrate significant improvements: up to +30% in long-text retrieval Recall@1, +2% average accuracy in classification tasks, accelerated training convergence, and a reduction in token wastage from 55% to 2.2%.

Technology Category

Application Category

📝 Abstract

Embedding vision-language models (VLMs) are typically pretrained with short text windows (<77 tokens), which forces the truncation of long-format captions. Yet, the distribution of biomedical captions from large-scale open source literature reveals that a huge portion of captions far exceed 77 tokens. To this end, we investigate the impact of pretraining on long-format biomedical captions by extending the context length of text encoders in VLMs. We find that longer context (thus, enabling additional supervision provided in long-format captions) correlates with better retrieval and classification performance. Given this finding, we introduce BIOMEDICA-LongCAP, a dataset of 1M image-caption pairs enriched with context-aware descriptions from full-text articles, providing longer and additional textual supervision. Using BIOMEDICA-LongCAP, we train BMC-LongCLIP, a long-context biomedical VLM with a text encoder supporting windows of up to 512 tokens. Our model extends context capacity by 6.6x, reducing token waste from 55% to just 2.2%. On long-caption retrieval benchmarks, BMC-LongCLIP achieves up to +30% absolute gains in Recall@1 and +2% average improvements in classification, while also converging faster than short-context. Our results demonstrate that long-context modeling is a promising direction for advancing biomedical VLMs.

Problem

Research questions and friction points this paper is trying to address.

Extending text encoder context length for biomedical VLMs

Reducing token waste from long-format biomedical captions

Improving retrieval and classification with longer context supervision

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extending text encoder context to 512 tokens

Using long-format biomedical captions for pretraining

Creating context-aware image-caption dataset BIOMEDICA-LongCAP

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs