🤖 AI Summary
This work addresses the under-training of rare tokens and context collapse in large language models, issues arising from injecting token indices only once at the input layer. To mitigate these limitations, the authors propose TIDE, a novel architecture that integrates an EmbeddingMemory mechanism into the standard Transformer. This mechanism dynamically injects and persistently preserves original token identity information across all network layers through multiple independent MemoryBlocks, a depth-wise conditional Softmax router, and a learnable null-token repository. Experimental results demonstrate that TIDE effectively alleviates the aforementioned problems and achieves significant performance improvements across a range of language modeling benchmarks and downstream tasks.
📝 Abstract
We revisit a universally accepted but under-examined design choice in every modern LLM: a token index is looked up once at the input embedding layer and then permanently discarded. This single-injection assumption induces two structural failures: (i) the Rare Token Problem, where a Zipf-type distribution of vocabulary causes rare-token embeddings are chronically under-trained due to receiving a fraction of the cumulative gradient signal compared to common tokens; and (ii) the Contextual Collapse Problem, where limited parameters models map distributionally similar tokens to indistinguishable hidden states. As an attempt to address both, we propose TIDE, which augments the standard transformer with EmbeddingMemory: an ensemble of K independent MemoryBlocks that map token indices to context-free semantic vectors, computed once and injected into every layer through a depth-conditioned softmax router with a learnable null bank. We theoretically and empirically establish the benefits of TIDE in addressing the issues associated with single-token identity injection as well as improve performance across multiple language modeling and downstream tasks.