🤖 AI Summary
Large language models (LLMs) exhibit limited in-context learning (ICL) capabilities compared to biological neural systems. Method: Inspired by associative memory mechanisms in neuroscience, we propose a novel attention residual flow architecture that introduces cross-layer direct connections between attention heads and a head-value focusing mechanism to enable efficient information propagation and integration along residual paths. Contribution/Results: This work is the first to incorporate associative memory modeling into Transformer residual design. Evaluated on a compact two-layer model with only 8M parameters, our architecture demonstrates earlier emergence of ICL ability and achieves significantly higher accuracy across multiple standard ICL benchmarks. These results indicate that the proposed architecture enhances contextual generalization in small-scale models under more biologically plausible computational constraints, establishing a new brain-inspired paradigm for ICL modeling.
📝 Abstract
Large language models (LLMs) demonstrate an impressive ability to utilise information within the context of their input sequences to appropriately respond to data unseen by the LLM during its training procedure. This ability is known as in-context learning (ICL). Humans and non-human animals demonstrate similar abilities, however their neural architectures differ substantially from LLMs. Despite this, a critical component within LLMs, the attention mechanism, resembles modern associative memory models, widely used in and influenced by the computational neuroscience community to model biological memory systems. Using this connection, we introduce an associative memory model capable of performing ICL. We use this as inspiration for a novel residual stream architecture which allows information to directly flow between attention heads. We test this architecture during training within a two-layer Transformer and show its ICL abilities manifest more quickly than without this modification. We then apply our architecture in small language models with 8 million parameters, focusing on attention head values, with results also indicating improved ICL performance at this larger and more naturalistic scale.