An approach to identify the most semantically informative deep representations of text and images

📅 2025-05-21

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This study aims to identify the “semantic core layers”—those encoding the richest semantic information—in large language models (LLMs) and vision Transformers (ViTs), and to quantify the information content and directional asymmetry of cross-modal representations. Method: We propose a quantitative analytical framework grounded in the information bottleneck principle and inter-layer mutual information, integrated with translation-aligned modeling, caption–image prediction, and cross-modal similarity probing. Contribution/Results: We systematically identify semantic-critical layers in LLMs (DeepSeek-V3, Llama3.1-8B) and ViTs for the first time. Key findings include: (i) semantic information exhibits long-range token dependencies and causal asymmetry across layers; (ii) semantic layers in LLMs generalize to predict ViT image representations; and (iii) cross-modal semantic information flow displays strong, model-dependent unidirectional dominance—e.g., from text to image or vice versa—rather than symmetry. These results provide theoretical foundations and a novel interpretability pathway for multimodal representation learning.

Technology Category

Application Category

📝 Abstract

Deep neural networks are known to develop similar representations for semantically related data, even when they belong to different domains, such as an image and its description, or the same text in different languages. We present a method for quantitatively investigating this phenomenon by measuring the relative information content of the representations of semantically related data and probing how it is encoded into multiple tokens of large language models (LLMs) and vision transformers. Looking first at how LLMs process pairs of translated sentences, we identify inner ``semantic'' layers containing the most language-transferable information. We find moreover that, on these layers, a larger LLM (DeepSeek-V3) extracts significantly more general information than a smaller one (Llama3.1-8B). Semantic information is spread across many tokens and it is characterized by long-distance correlations between tokens and by a causal left-to-right (i.e., past-future) asymmetry. We also identify layers encoding semantic information within visual transformers. We show that caption representations in the semantic layers of LLMs predict visual representations of the corresponding images. We observe significant and model-dependent information asymmetries between image and text representations.

Problem

Research questions and friction points this paper is trying to address.

Identifying most informative deep representations of text and images

Measuring semantic information content across different data domains

Analyzing information asymmetries between image and text representations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Measure semantic information in deep representations

Identify semantic layers in LLMs and vision transformers

Analyze information asymmetries between image and text

🔎 Similar Papers

No similar papers found.