LongProLIP: A Probabilistic Vision-Language Model with Long Context Text

πŸ“… 2025-03-11
πŸ“ˆ Citations: 1
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing Probabilistic Language-Image Pre-Training (ProLIP) models are constrained by a 64-token text input limit, hindering effective long-context semantic modeling. Method: We propose the first long-context extension of ProLIP, scaling its text encoder capacity to 256 tokens while preserving zero-shot transferability. Our approach introduces a lightweight fine-tuning strategy that jointly optimizes the extended text encoder and the cross-modal probabilistic alignment mechanism, ensuring computational efficiency and modality consistency. Results: On the Urban-1k long-text visual understanding benchmark, our method achieves significant performance gains. Zero-shot evaluation on DataComp shows only marginal accuracy degradation (<0.5% on average), confirming a robust trade-off between long-context expressivity and generalization fidelity. This work establishes the first scalable, high-fidelity adaptation framework for ProLIP in long-text–driven vision-language understanding.

Technology Category

Application Category

πŸ“ Abstract
Recently, Probabilistic Language-Image Pre-Training (ProLIP) has been proposed to tackle the multiplicity issue of vision-language (VL) tasks. Despite their success in probabilistic representation learning at a scale, the ProLIP models cannot handle long context texts longer than 64 context length, which limits their ability to capture rich contextual information from longer text sequences. To address this issue, this paper proposes a fine-tuning strategy for ProLIP to accept longer texts, e.g., 256 text tokens. Experimental results on Urban-1k and the DataComp evaluation suite show that the proposed LongProLIP recipe can improve understanding of long contexts while minimizing the negative effect of fine-tuning. We also observe a trade-off between the long context understanding (measured by Urban-1k) and general zero-shot capability (measured by ImageNet or the average of 38 zero-shot evaluation datasets by DataComp).
Problem

Research questions and friction points this paper is trying to address.

Extends ProLIP to handle long context texts
Improves understanding of rich contextual information
Balances long context understanding and zero-shot capability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuning ProLIP for longer text contexts
Extending context length to 256 tokens
Balancing long-context understanding and zero-shot capability
πŸ”Ž Similar Papers
No similar papers found.