A Dual-Path Architecture for Scaling Compute and Capacity in LLMs

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This work addresses the inherent trade-off between computational depth and model capacity in recurrent Transformers under fixed FLOPs constraints, where parameter sharing limits performance. To overcome this, the authors propose a dual-path architecture that, within a single layer, parallelly introduces a depth path—comprising K repetitions of a shared sublayer—and a width path featuring an expanded feed-forward network. A learnable, token-wise gating mechanism dynamically fuses these two paths, enabling explicit decoupling and joint optimization of depth and capacity at a fine-grained, interpretable level. Experiments demonstrate that, under identical FLOPs budgets, the proposed model achieves superior performance on language modeling and downstream tasks with fewer parameters than baseline models. Gating analysis further reveals that functional words prefer the width path, while punctuation, symbols, and arithmetic tokens favor the depth path.

📝 Abstract

Looped transformers apply a shared block multiple times and have emerged as a parameter-efficient route to scaling compute in language models. However, at fixed FLOPs a looped model has strictly less capacity than a baseline transformer. We propose a novel dual-path block that can flexibly scale compute, the number of sequential operations applied to a hidden state, and capacity, the parameters available at a single step. For this we expose both axes as parallel pathways within a single layer: a deep sublayer re-applied K times with shared parameters, and a wide sublayer with an enlarged feed-forward network applied once. Independent per-token gates combine both axes and allow detailed per-token routing analyses. We show that across two FLOP budgets, our dual-path model surpasses iso-FLOP matched models on language modeling and downstream evaluations, while using fewer parameters than the baseline at matched FLOPs. The learned gates are directly interpretable and show systematic per-token allocation with function words and lexical content trend wide, while punctuation, symbols, and arithmetic tokens trend deep.

Problem

Research questions and friction points this paper is trying to address.

compute scaling

model capacity

looped transformers

parameter efficiency

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

dual-path architecture

looped transformers

compute-capacity scaling