Better Language Models Exhibit Higher Visual Alignment

📅 2024-10-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates whether purely text-based large language models (LLMs) inherently possess visual alignment capabilities. Method: We propose a zero-shot generalization framework that freezes the text encoder and integrates a discriminative vision–language architecture—requiring neither fine-tuning nor paired vision–language data. Contribution/Results: We empirically demonstrate, for the first time, that decoder-only LLMs intrinsically encode strong visual-semantic alignment, with this capability scaling positively with model size. Moreover, the approach supports cross-lingual zero-shot transfer. On Chinese cross-lingual zero-shot image classification, it achieves 38.7% accuracy—outperforming CLIP by 1.4 percentage points. Critically, it substantially reduces dependence on labeled data and computational resources while enhancing model robustness, generalization, and accessibility.

Technology Category

Application Category

📝 Abstract
How well do text-only Large Language Models (LLMs) naturally align with the visual world? We provide the first direct analysis by utilizing frozen text representations in a discriminative vision-language model framework and measuring zero-shot generalization on unseen classes. We find decoder-based LLMs exhibit high intrinsic visual alignment. In particular, more capable LLMs reliably demonstrate stronger generalization. Moreover, utilizing frozen LLMs leads to strong gains in cross-lingual settings, where our approach surpasses CLIP's accuracy of 1.4% with 38.7% for Chinese. Our proposed method improves both robustness and generalization and also significantly reduces the need for paired data and compute, making vision-language models more accessible and adaptable.
Problem

Research questions and friction points this paper is trying to address.

Assessing text-only LLMs' visual alignment
Evaluating zero-shot generalization in vision-language models
Enhancing cross-lingual accuracy and reducing data needs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes frozen text representations
Measures zero-shot generalization
Reduces need for paired data
🔎 Similar Papers
No similar papers found.
J
Jona Ruthardt
FunAI Lab, University of Technology Nuremberg
G
G. Burghouts
Intelligent Imaging, TNO
S
Serge J. Belongie
Department of Computer Science, University of Copenhagen
Yuki M. Asano
Yuki M. Asano
Full Professor, Head of FunAI Lab, University of Technology Nuremberg
Deep LearningMultimodal LearningSelf-supervised LearningLarge Model AdaptationLLMs