Thinking in Different Spaces: Domain-Specific Latent Geometry Survives Cross-Architecture Translation

📅 2026-03-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates whether language models with heterogeneous architectures share alignable latent representational geometries during reasoning, enabling behavior modification without weight updates. By learning a linear projection from a teacher model’s activations to a student model’s residual stream via ridge regression, the method substitutes student activations during generation to perform cross-architectural interventions. The study reveals, for the first time, the existence of domain-specific latent subspaces in reasoning tasks—such as language and mathematics—that are not transferable across domains, a property consistently observed across 20 diverse model pairs. Experiments demonstrate average behavior correction rates of 25.2% on TruthfulQA and 25.5% on GSM8K, while cross-domain projection performance drops to an R² of −3.83, strongly corroborating the pronounced domain specificity of latent geometric structures.

Technology Category

Application Category

📝 Abstract
We investigate whether independently trained language models converge to geometrically compatible latent representations, and whether this compatibility can be exploited to correct model behavior at inference time without any weight updates. We learn a linear projection matrix that maps activation vectors from a large teacher model into the coordinate system of a smaller student model, then intervene on the student's residual stream during generation by substituting its internal state with the translated teacher representation. Across a fully crossed experimental matrix of 20 heterogeneous teacher-student pairings spanning mixture-of-experts, dense, code-specialized, and synthetically trained architectures, the Ridge projection consistently achieves R^2 = 0.50 on verbal reasoning and R^2 = 0.40 on mathematical reasoning, collapsing to R^2 = -0.22 under permutation control and R^2 = 0.01 under L_1 regularization. Behavioral correction rates range from 14.0% to 50.0% on TruthfulQA (mean 25.2%) and from 8.5% to 43.3% on GSM8K arithmetic reasoning (mean 25.5%), demonstrating that the method generalizes across fundamentally different reasoning domains. We report a near-zero correlation between geometric alignment quality and behavioral correction rate (r = -0.07), revealing a dissociation between representation space fidelity and output space impact. Intervention strength is architecture-specific: student models exhibit characteristic sensitivity profiles that invert across domains, with the most steerable verbal student becoming the least steerable mathematical student. Finally, a double dissociation experiment conducted across all 20 model pairings confirms without exception that projection matrices collapse catastrophically when transferred across reasoning domains (mean R^2 = -3.83 in both transfer directions), establishing domain-specific subspace geometry as a universal property of LMs.
Problem

Research questions and friction points this paper is trying to address.

latent geometry
cross-architecture translation
domain-specific representation
behavioral correction
language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

domain-specific geometry
cross-architecture alignment
latent space projection
behavioral steering
representation translation
🔎 Similar Papers
No similar papers found.