Frequency Is What You Need: Word-frequency Masking Benefits Vision-Language Model Pre-training

📅 2024-12-20

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

To address low data and computational efficiency in vision-language model (VLM) pretraining, this paper proposes a frequency-based dynamic text masking strategy. We observe that the optimal masking pattern evolves dynamically across training epochs, with word frequency serving as the key regulatory factor. Building upon this insight, we introduce CLIPF—a framework integrating part-of-speech-aware word frequency calibration and dynamic masking scheduling—enabling efficient pretraining on small-scale datasets and low-token inputs. CLIPF consistently outperforms baseline masking schemes—including truncation, random, block, and syntactic masking—across multiple benchmarks. Notably, under token-reduction scenarios, it achieves average downstream task accuracy gains of 2.3%–4.1%. This work is the first to empirically uncover and exploit the dynamic nature of word frequency for efficient VLM pretraining, advancing both training efficiency and generalization capability.

Technology Category

Application Category

📝 Abstract

Vision Language Models (VLMs) can be trained more efficiently if training sets can be reduced in size. Recent work has shown the benefits of masking text during VLM training using a variety of approaches: truncation, random masking, block masking and syntax masking. In this paper, we show that the best masking strategy changes over training epochs and that, given sufficient training epochs, word frequency information is what you need to achieve the best performance. Experiments on a large range of data sets demonstrate the advantages of our approach, called Contrastive Language-Image Pre-training with word Frequency Masking (CLIPF). The benefits are particularly evident as the number of input tokens decreases. We analyze the impact of CLIPF vs. other masking approaches on word frequency balance and discuss the apparently critical contribution of CLIPF in maintaining word frequency balance across POS categories.

Problem

Research questions and friction points this paper is trying to address.

Optimizing text masking strategies for Vision-Language Model training

Analyzing word frequency impact on masking effectiveness

Improving efficiency with fewer input tokens using frequency masking

Innovation

Methods, ideas, or system contributions that make the work stand out.

Word-frequency masking improves VLM training

Dynamic masking strategy adapts over epochs

CLIPF outperforms syntax masking in efficiency

🔎 Similar Papers

Better Language Models Exhibit Higher Visual Alignment