🤖 AI Summary
This study investigates whether convergence in internal representations of language models implies alignment in their reasoning processes. Analyzing 16 models of diverse architectures and scales across 800 questions spanning mathematics, science, commonsense, and factual knowledge, the authors conduct a layered analysis—stratified by question difficulty, computational stage, and causal relevance—using centered kernel alignment (CKA), cross-model decoding accuracy, and causal ablation experiments. They uncover three dissociations between representational convergence and reasoning divergence: difficulty reversal, generation gaps, and incidental correctness. Notably, models exhibit high representational similarity on failed questions (CKA = 0.897) but diverge sharply post-decision (CKA = 0.274). Although shared representations are decodable, they exert minimal causal influence on predictions, with intervention-induced flip rates of only 1.5%–5.5%, thereby challenging the assumption that similar representations entail consistent reasoning.
📝 Abstract
Large language models trained under diverse objectives and architectures have been shown to develop increasingly similar internal representations, an observation formalized as the Platonic Representation Hypothesis. Whether this representational convergence extends to the reasoning processes that operate over shared representations remains untested. We evaluate representational similarity across 16 language models from 8 families (1.5B to 72B parameters) on 800 reasoning problems spanning mathematics, science, commonsense, and truthfulness, stratifying by problem difficulty, computational stage, and causal relevance. Our analysis reveals three dissociations: a difficulty inversion, where models converge more on problems they collectively fail (Centered Kernel Alignment [CKA] = 0.897) than on those they solve (CKA = 0.830); a generation gap, where pre-decision representations align (CKA = 0.875) while post-decision representations diverge (CKA = 0.274); and epiphenomenal correctness, where shared information is decodable across models (66% transfer accuracy) but exerts minimal causal influence on predictions (1.5% to 5.5% flip rate across ablation protocols). These results indicate that representational convergence in language models reflects shared input processing constraints rather than shared reasoning strategies, with direct implications for ensemble design, interpretability transfer, and evaluations of model similarity. Code is available at https://github.com/Usama1002/convergence-without-understanding.