🤖 AI Summary
This study aims to identify the “semantic core layers”—those encoding the richest semantic information—in large language models (LLMs) and vision Transformers (ViTs), and to quantify the information content and directional asymmetry of cross-modal representations.
Method: We propose a quantitative analytical framework grounded in the information bottleneck principle and inter-layer mutual information, integrated with translation-aligned modeling, caption–image prediction, and cross-modal similarity probing.
Contribution/Results: We systematically identify semantic-critical layers in LLMs (DeepSeek-V3, Llama3.1-8B) and ViTs for the first time. Key findings include: (i) semantic information exhibits long-range token dependencies and causal asymmetry across layers; (ii) semantic layers in LLMs generalize to predict ViT image representations; and (iii) cross-modal semantic information flow displays strong, model-dependent unidirectional dominance—e.g., from text to image or vice versa—rather than symmetry. These results provide theoretical foundations and a novel interpretability pathway for multimodal representation learning.
📝 Abstract
Deep neural networks are known to develop similar representations for semantically related data, even when they belong to different domains, such as an image and its description, or the same text in different languages. We present a method for quantitatively investigating this phenomenon by measuring the relative information content of the representations of semantically related data and probing how it is encoded into multiple tokens of large language models (LLMs) and vision transformers. Looking first at how LLMs process pairs of translated sentences, we identify inner ``semantic'' layers containing the most language-transferable information. We find moreover that, on these layers, a larger LLM (DeepSeek-V3) extracts significantly more general information than a smaller one (Llama3.1-8B). Semantic information is spread across many tokens and it is characterized by long-distance correlations between tokens and by a causal left-to-right (i.e., past-future) asymmetry. We also identify layers encoding semantic information within visual transformers. We show that caption representations in the semantic layers of LLMs predict visual representations of the corresponding images. We observe significant and model-dependent information asymmetries between image and text representations.