Nodes Are Early, Edges Are Late: Probing Diagram Representations in Large Vision-Language Models

📅 2026-03-03

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Large vision-language models struggle to comprehend abstract relational structures in diagrams, such as the directionality of edges. To investigate this limitation, this work constructs a synthetic directed graph dataset and employs linear probing alongside cross-modal representation analysis to systematically examine how node and edge information is encoded within the model. The study reveals, for the first time, that node representations become linearly separable already at the output of the visual encoder, whereas edge directionality only becomes linearly separable in the textual token representations produced by the language model. This finding indicates that the bottleneck in relational understanding stems from a delayed emergence of edge representations, highlighting a functional division of labor between the visual and linguistic modules in modeling structured relationships.

Technology Category

Application Category

📝 Abstract

Large vision-language models (LVLMs) demonstrate strong performance on diagram understanding benchmarks, yet they still struggle with understanding relationships between elements, particularly those represented by nodes and directed edges (e.g., arrows and lines). To investigate the underlying causes of this limitation, we probe the internal representation of LVLMs using a carefully constructed synthetic diagram dataset based on directed graphs. Our probing experiments reveal that edge information is not linearly separable in the vision encoder and becomes linearly encoded only in the text tokens in the language model. In contrast, node information and global structural features are already linearly encoded in individual hidden states of the vision encoder. These findings suggest that the stage at which linearly separable representations are formed varies depending on the type of visual information. In particular, the delayed emergence of edge representations may help explain why LVLMs struggle with relational understanding, such as interpreting edge directions, which require more abstract, compositionally integrated processes.

Problem

Research questions and friction points this paper is trying to address.

diagram understanding

relational reasoning

vision-language models

directed edges

node-edge representation

Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-language models

diagram understanding

representation probing