🤖 AI Summary
This work investigates why and how multimodal neural representations converge toward a shared structure, with particular emphasis on the directionality of this convergence. By introducing cycle-kNN—an asymmetric alignment metric—and integrating feature density analysis with information bottleneck theory, the study systematically examines representational relationships across dozens of independently trained unimodal models spanning point clouds, vision, and language. It reveals, for the first time, that non-linguistic modalities consistently align toward the neighborhood structure of linguistic representations, leading to the proposal of the “Wittgensteinian Representation Hypothesis”: linguistic semantic structure acts as an asymptotic attractor in multimodal representation spaces. Furthermore, language representations are found to be the most compact in embedding space, driving multimodal systems—under compression-driven optimization—toward discrete, compositional structures, a phenomenon robustly observed across diverse model families and scales.
📝 Abstract
Understanding why independently trained neural networks from different modalities converge toward shared representations, and where this convergence leads, remains an open question in representation learning. All existing evidence relies on symmetric similarity measures, which can detect convergence but are structurally blind to its direction. We introduce directional convergence analysis using cycle-kNN, an asymmetric alignment measure, applied across dozens of independently trained unimodal models spanning point clouds, vision, and language. We uncover a consistent directional asymmetry: non-language modalities move toward the neighborhood structure of language significantly more than the reverse, and this pattern holds across all model families and scales--yet is entirely invisible to symmetric measures. Mechanistic analysis traces the directionality to feature density asymmetry, whereby language representations occupy the most compact regions of representational space. The Information Bottleneck framework provides a principled interpretation: optimization under compression drives representations toward discrete, compositional structures characteristic of language. We formalize this as the Wittgensteinian Representation Hypothesis: the semantic structure of language is the asymptotic attractor of multimodal representation convergence.