🤖 AI Summary
This work proposes a novel function class inspired by continued fractions for generative modeling, aiming to reduce the parameter count and training cost of language models. The authors introduce CoFrGeNet, a new architecture that replaces the multi-head attention and feed-forward network modules in Transformers with lightweight, plug-and-play components based on continued fraction structures—marking the first application of such structures in language modeling. Dedicated gradient formulations are derived to enhance optimization efficiency. Evaluated on large-scale models including GPT-2-xl and Llama3, the approach achieves comparable or superior performance on tasks such as classification, question answering, and reasoning, while using only one-half to two-thirds of the original model’s parameters and requiring less pretraining time.
📝 Abstract
Transformers are arguably the preferred architecture for language generation. In this paper, inspired by continued fractions, we introduce a new function class for generative modeling. The architecture family implementing this function class is named CoFrGeNets - Continued Fraction Generative Networks. We design novel architectural components based on this function class that can replace Multi-head Attention and Feed-Forward Networks in Transformer blocks while requiring much fewer parameters. We derive custom gradient formulations to optimize the proposed components more accurately and efficiently than using standard PyTorch-based gradients. Our components are a plug-in replacement requiring little change in training or inference procedures that have already been put in place for Transformer-based models thus making our approach easy to incorporate in large industrial workflows. We experiment on two very different transformer architectures GPT2-xl (1.5B) and Llama3 (3.2B), where the former we pre-train on OpenWebText and GneissWeb, while the latter we pre-train on the docling data mix which consists of nine different datasets. Results show that the performance on downstream classification, Q\&A, reasoning and text understanding tasks of our models is competitive and sometimes even superior to the original models with $\frac{2}{3}$ to $\frac{1}{2}$ the parameters and shorter pre-training time. We believe that future implementations customized to hardware will further bring out the true potential of our architectures.