π€ AI Summary
Existing Probabilistic Language-Image Pre-Training (ProLIP) models are constrained by a 64-token text input limit, hindering effective long-context semantic modeling. Method: We propose the first long-context extension of ProLIP, scaling its text encoder capacity to 256 tokens while preserving zero-shot transferability. Our approach introduces a lightweight fine-tuning strategy that jointly optimizes the extended text encoder and the cross-modal probabilistic alignment mechanism, ensuring computational efficiency and modality consistency. Results: On the Urban-1k long-text visual understanding benchmark, our method achieves significant performance gains. Zero-shot evaluation on DataComp shows only marginal accuracy degradation (<0.5% on average), confirming a robust trade-off between long-context expressivity and generalization fidelity. This work establishes the first scalable, high-fidelity adaptation framework for ProLIP in long-textβdriven vision-language understanding.
π Abstract
Recently, Probabilistic Language-Image Pre-Training (ProLIP) has been proposed to tackle the multiplicity issue of vision-language (VL) tasks. Despite their success in probabilistic representation learning at a scale, the ProLIP models cannot handle long context texts longer than 64 context length, which limits their ability to capture rich contextual information from longer text sequences. To address this issue, this paper proposes a fine-tuning strategy for ProLIP to accept longer texts, e.g., 256 text tokens. Experimental results on Urban-1k and the DataComp evaluation suite show that the proposed LongProLIP recipe can improve understanding of long contexts while minimizing the negative effect of fine-tuning. We also observe a trade-off between the long context understanding (measured by Urban-1k) and general zero-shot capability (measured by ImageNet or the average of 38 zero-shot evaluation datasets by DataComp).