T-FREE: Subword Tokenizer-Free Generative LLMs via Sparse Representations for Memory-Efficient Embeddings

📅 2024-06-27
🏛️ Conference on Empirical Methods in Natural Language Processing
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
Existing large language model tokenizers suffer from high computational overhead, lexical redundancy, architectural complexity, and poor adaptability to low-resource languages—leading to excessive embedding-layer memory consumption and weak cross-lingual generalization. This paper proposes the first tokenizer-free generative large model architecture: it operates directly on character trigrams as fundamental units and learns unsupervised vocabulary embeddings via sparse activation patterns, eliminating both corpus-dependent preprocessing and subword segmentation. The approach inherently captures morphological similarity, reduces embedding-layer parameters by over 85% without performance degradation, and concurrently decreases attention-head parameters by more than 85%. Empirical evaluation demonstrates substantially improved transfer performance for low-resource languages across multilingual downstream tasks, validating that sparse character-level representations achieve a superior trade-off between efficiency and generalization.

Technology Category

Application Category

📝 Abstract
Tokenizers are crucial for encoding information in Large Language Models, but their development has recently stagnated, and they contain inherent weaknesses. Major limitations include computational overhead, ineffective vocabulary use, and unnecessarily large embedding and head layers. Additionally, their performance is biased towards a reference corpus, leading to reduced effectiveness for underrepresented languages.To remedy these issues, we propose T-Free, which directly embeds words through sparse activation patterns over character triplets and does not require a reference corpus. T-Free inherently exploits morphological similarities and allows for strong compression of embedding layers. In our exhaustive experimental evaluation, we achieve competitive downstream performance with a parameter reduction of more than 85% on these layers. Further, T-Free shows significant improvements in cross-lingual transfer learning.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Tokenization Efficiency
Resource Consumption
Innovation

Methods, ideas, or system contributions that make the work stand out.

T-FREE
Memory-Efficient Language Processing
Cross-Linguistic Learning
🔎 Similar Papers
No similar papers found.
Björn Deiseroth
Björn Deiseroth
TU Darmstadt
Manuel Brack
Manuel Brack
Applied Research Scientist @ Adobe | Adjunct Researcher @ hessian.AI
Machine Learning
P
P. Schramowski
Technical University Darmstadt, Hessian Center for Artificial Intelligence (hessian.AI), German Research Center for Artificial Intelligence (DFKI)
K
K. Kersting
Technical University Darmstadt, Hessian Center for Artificial Intelligence (hessian.AI), German Research Center for Artificial Intelligence (DFKI)
S
Samuel Weinbach
Aleph Alpha @ IPAI