A Dual-Path Architecture for Scaling Compute and Capacity in LLMs

📅 2026-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the inherent trade-off between computational depth and model capacity in recurrent Transformers under fixed FLOPs constraints, where parameter sharing limits performance. To overcome this, the authors propose a dual-path architecture that, within a single layer, parallelly introduces a depth path—comprising K repetitions of a shared sublayer—and a width path featuring an expanded feed-forward network. A learnable, token-wise gating mechanism dynamically fuses these two paths, enabling explicit decoupling and joint optimization of depth and capacity at a fine-grained, interpretable level. Experiments demonstrate that, under identical FLOPs budgets, the proposed model achieves superior performance on language modeling and downstream tasks with fewer parameters than baseline models. Gating analysis further reveals that functional words prefer the width path, while punctuation, symbols, and arithmetic tokens favor the depth path.
📝 Abstract
Looped transformers apply a shared block multiple times and have emerged as a parameter-efficient route to scaling compute in language models. However, at fixed FLOPs a looped model has strictly less capacity than a baseline transformer. We propose a novel dual-path block that can flexibly scale compute, the number of sequential operations applied to a hidden state, and capacity, the parameters available at a single step. For this we expose both axes as parallel pathways within a single layer: a deep sublayer re-applied K times with shared parameters, and a wide sublayer with an enlarged feed-forward network applied once. Independent per-token gates combine both axes and allow detailed per-token routing analyses. We show that across two FLOP budgets, our dual-path model surpasses iso-FLOP matched models on language modeling and downstream evaluations, while using fewer parameters than the baseline at matched FLOPs. The learned gates are directly interpretable and show systematic per-token allocation with function words and lexical content trend wide, while punctuation, symbols, and arithmetic tokens trend deep.
Problem

Research questions and friction points this paper is trying to address.

compute scaling
model capacity
looped transformers
parameter efficiency
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

dual-path architecture
looped transformers
compute-capacity scaling
parameter-efficient scaling
per-token gating
🔎 Similar Papers
No similar papers found.