🤖 AI Summary
Large vision-language models struggle to comprehend abstract relational structures in diagrams, such as the directionality of edges. To investigate this limitation, this work constructs a synthetic directed graph dataset and employs linear probing alongside cross-modal representation analysis to systematically examine how node and edge information is encoded within the model. The study reveals, for the first time, that node representations become linearly separable already at the output of the visual encoder, whereas edge directionality only becomes linearly separable in the textual token representations produced by the language model. This finding indicates that the bottleneck in relational understanding stems from a delayed emergence of edge representations, highlighting a functional division of labor between the visual and linguistic modules in modeling structured relationships.
📝 Abstract
Large vision-language models (LVLMs) demonstrate strong performance on diagram understanding benchmarks, yet they still struggle with understanding relationships between elements, particularly those represented by nodes and directed edges (e.g., arrows and lines). To investigate the underlying causes of this limitation, we probe the internal representation of LVLMs using a carefully constructed synthetic diagram dataset based on directed graphs. Our probing experiments reveal that edge information is not linearly separable in the vision encoder and becomes linearly encoded only in the text tokens in the language model. In contrast, node information and global structural features are already linearly encoded in individual hidden states of the vision encoder. These findings suggest that the stage at which linearly separable representations are formed varies depending on the type of visual information. In particular, the delayed emergence of edge representations may help explain why LVLMs struggle with relational understanding, such as interpreting edge directions, which require more abstract, compositionally integrated processes.