Hyperloop Transformers

📅 2026-04-22

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses the challenge of deploying large language models on memory-constrained edge and end devices by proposing a novel parameter-efficient Transformer architecture. The model partitions the network into initial, intermediate, and final segments, applying depth-wise recurrent reuse exclusively to the intermediate blocks. It further introduces, for the first time, a matrix-form hyperconnected residual stream that seamlessly integrates recurrence with hyperconnectivity. This design achieves approximately 50% parameter reduction while outperforming both standard Transformers and mHC Transformers of equivalent depth in terms of performance. Notably, the advantage persists even after weight quantization, making the proposed architecture particularly well-suited for memory-limited deployment scenarios.

Technology Category

Application Category

📝 Abstract

LLM architecture research generally aims to maximize model quality subject to fixed compute/latency budgets. However, many applications of interest such as edge and on-device deployment are further constrained by the model's memory footprint, thus motivating parameter-efficient architectures for language modeling. This paper describes a simple architecture that improves the parameter-efficiency of LLMs. Our architecture makes use of looped Transformers as a core primitive, which reuse Transformer layers across depth and are thus more parameter-efficient than ordinary (depth-matched) Transformers. We organize the looped Transformer into three blocks--begin, middle, and end blocks--where each block itself consists of multiple Transformer layers, and only the middle block is applied recurrently across depth. We augment the looped middle block with hyper-connections (Xie et al., 2026), which expand the residual stream into matrix-valued residual streams. Hyper-connections are applied only after each loop, and therefore add minimal new parameters and compute cost. Across various model scales, we find that our Hyper-Connected Looped Transformer (Hyperloop Transformer) is able to outperform depth-matched Transformer and mHC Transformer baselines despite using approximately 50% fewer parameters. The outperformance persists through post-training weight quantization, thus positioning Hyperloop Transformers as an attractive architecture for memory-efficient language modeling.

Problem

Research questions and friction points this paper is trying to address.

parameter-efficient

memory footprint

language modeling

edge deployment

on-device inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

Looped Transformers

Hyper-connections

Parameter-efficient LLMs