Sample-Efficient Language Modeling with Linear Attention and Lightweight Enhancements

📅 2025-11-04

🏛️ Proceedings of the First BabyLM Workshop

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses the BabyLM 2025 low-resource language modeling challenge, where data scarcity severely limits model performance. Method: We propose BLaLM—a sample-efficient architecture that replaces standard self-attention with linear-time mLSTM and integrates lightweight enhancements: sliding-window attention, dynamic modulation, and Hedgehog feature mapping; short convolutions strengthen local context modeling, while pedagogically curated, high-readability corpora ensure effective supervision; training employs the Muon optimizer to stabilize optimization for small-scale models. Contribution/Results: Experiments demonstrate that BLaLM achieves significantly lower perplexity under strict data constraints and consistently outperforms baselines in zero-shot generalization. The results validate a viable alternative to parameter- and data-hungry paradigms, establishing a reproducible architectural blueprint and training methodology for efficient language modeling in resource-constrained settings.

Technology Category

Application Category

📝 Abstract

We study architectural and optimization tech- niques for sample-efficient language modeling under the constraints of the BabyLM 2025 shared task. Our model, BLaLM, replaces self-attention with a linear-time mLSTM to- ken mixer and explores lightweight enhance- ments, including short convolutions, sliding window attention with dynamic modulation, and Hedgehog feature maps. To support train- ing in low-resource settings, we curate a high- quality corpus emphasizing readability and ped- agogical structure. Experiments across both STRICT and STRICT-SMALL tracks show that (1) linear attention combined with sliding win- dow attention consistently improves zero-shot performance, and (2) the Muon optimizer stabi- lizes convergence and reduces perplexity over AdamW. These results highlight effective strate- gies for efficient language modeling without relying on scale.

Problem

Research questions and friction points this paper is trying to address.

Developing sample-efficient language models with linear-time attention mechanisms

Exploring lightweight architectural enhancements for low-resource language modeling

Optimizing training stability and performance in constrained computational settings

Innovation

Methods, ideas, or system contributions that make the work stand out.

Linear attention replaces self-attention mechanism

Lightweight enhancements include sliding window attention

Muon optimizer stabilizes convergence over AdamW

🔎 Similar Papers

From Sparse Dependence to Sparse Attention: Unveiling How Chain-of-Thought Enhances Transformer Sample Efficiency