Language-Pretraining-Induced Bias: A Strong Foundation for General Vision Tasks

📅 2026-04-02

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Conventional wisdom holds that language pre-trained models struggle to transfer directly to visual downstream tasks due to significant discrepancies in parameter space and task objectives. This work proposes a novel, annotation-free random-label bridging training strategy that aligns parameters of large language models with visual tasks while selectively freezing specific network layers, thereby enabling effective cross-modal adaptation. The study reveals that certain layers within language models inherently possess strong visual generalization capabilities, allowing them to perform basic visual tasks without fine-tuning. By successfully applying purely language-pretrained models to general vision tasks, this approach demonstrates the feasibility of unsupervised cross-modal bridging and establishes a new paradigm for transferring knowledge across modalities.

Technology Category

Application Category

📝 Abstract

The ratio of outlier parameters in language pre-training models and vision pre-training models differs significantly, making cross-modality (language and vision) inherently more challenging than cross-domain adaptation. As a result, many prior studies have focused on cross-domain transfer rather than attempting to bridge language and vision modalities, assuming that language pre-trained models are unsuitable for downstream visual tasks due to disparate parameter spaces. Contrary to this assumption, we show that adding a bridge training stage as a modality adaptation learner can effectively align Large Language Model (LLM) parameters with vision tasks. Specifically, we propose a simple yet powerful solution random label bridge training that requires no manual labeling and helps LLM parameters adapt to vision foundation tasks. Moreover, our findings reveal that partial bridge training is often advantageous, as certain layers in LLMs exhibit strong foundational properties that remain beneficial even without fine-tuning for visual tasks. This surprising discovery opens up new avenues for leveraging language pre-trained parameters directly within vision models and highlights the potential of partial bridge training as a practical pathway to cross-modality adaptation.

Problem

Research questions and friction points this paper is trying to address.

cross-modality adaptation

language pre-training

vision tasks

parameter space discrepancy

outlier parameters

Innovation

Methods, ideas, or system contributions that make the work stand out.

bridge training

cross-modality adaptation

language pre-training