CoFrGeNet: Continued Fraction Architectures for Language Generation

📅 2026-01-29

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

This work proposes a novel function class inspired by continued fractions for generative modeling, aiming to reduce the parameter count and training cost of language models. The authors introduce CoFrGeNet, a new architecture that replaces the multi-head attention and feed-forward network modules in Transformers with lightweight, plug-and-play components based on continued fraction structures—marking the first application of such structures in language modeling. Dedicated gradient formulations are derived to enhance optimization efficiency. Evaluated on large-scale models including GPT-2-xl and Llama3, the approach achieves comparable or superior performance on tasks such as classification, question answering, and reasoning, while using only one-half to two-thirds of the original model’s parameters and requiring less pretraining time.

Technology Category

Application Category

📝 Abstract

Transformers are arguably the preferred architecture for language generation. In this paper, inspired by continued fractions, we introduce a new function class for generative modeling. The architecture family implementing this function class is named CoFrGeNets - Continued Fraction Generative Networks. We design novel architectural components based on this function class that can replace Multi-head Attention and Feed-Forward Networks in Transformer blocks while requiring much fewer parameters. We derive custom gradient formulations to optimize the proposed components more accurately and efficiently than using standard PyTorch-based gradients. Our components are a plug-in replacement requiring little change in training or inference procedures that have already been put in place for Transformer-based models thus making our approach easy to incorporate in large industrial workflows. We experiment on two very different transformer architectures GPT2-xl (1.5B) and Llama3 (3.2B), where the former we pre-train on OpenWebText and GneissWeb, while the latter we pre-train on the docling data mix which consists of nine different datasets. Results show that the performance on downstream classification, Q\&A, reasoning and text understanding tasks of our models is competitive and sometimes even superior to the original models with $\frac{2}{3}$ to $\frac{1}{2}$ the parameters and shorter pre-training time. We believe that future implementations customized to hardware will further bring out the true potential of our architectures.

Problem

Research questions and friction points this paper is trying to address.

language generation

parameter efficiency

Transformer architecture

generative modeling

model compression

Innovation

Methods, ideas, or system contributions that make the work stand out.

Continued Fraction

Generative Modeling

Parameter-Efficient Architecture