🤖 AI Summary
This study investigates the feasibility of a “System 1–System 2”-style dual-architecture latent reasoning in large language models—specifically, whether reasoning should be decoupled into a base model and a co-processor, coordinated via latent variable exchange. The authors propose a dual-model latent communication framework and empirically validate two hypotheses: (H1) increased latent channel capacity improves performance, and (H2) joint fine-tuning facilitates emergent communication learning. Experiments are conducted on GPT-2 and Qwen-3 across GSM8K, ProsQA, and Countdown. Results show that joint fine-tuning significantly outperforms isolated module communication, achieving optimal performance under fixed latent budget; unified soft-embedding models approach its efficacy; yet scaling latent dimensionality fails to improve robustness on complex reasoning, revealing semantic overlap in latent space as the key bottleneck. The core contribution is the first empirical demonstration that latent communication efficacy is fundamentally constrained by representation alignment—not bandwidth—and the establishment of joint fine-tuning as the current optimal paradigm.
📝 Abstract
Should LLM reasoning live in a separate module, or within a single model's forward pass and representational space? We study dual-architecture latent reasoning, where a fluent Base exchanges latent messages with a Coprocessor, and test two hypotheses aimed at improving latent communication over Liu et al. (2024): (H1) increase channel capacity; (H2) learn communication via joint finetuning. Under matched latent-token budgets on GPT-2 and Qwen-3, H2 is consistently strongest while H1 yields modest gains. A unified soft-embedding baseline, a single model with the same forward pass and shared representations, using the same latent-token budget, nearly matches H2 and surpasses H1, suggesting current dual designs mostly add compute rather than qualitatively improving reasoning. Across GSM8K, ProsQA, and a Countdown stress test with increasing branching factor, scaling the latent-token budget beyond small values fails to improve robustness. Latent analyses show overlapping subspaces with limited specialization, consistent with weak reasoning gains. We conclude dual-model latent reasoning remains promising in principle, but likely requires objectives and communication mechanisms that explicitly shape latent spaces for algorithmic planning.