EmbBERT-Q: Breaking Memory Barriers in Embedded NLP

📅 2025-02-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of deploying large language models on memory-constrained microdevices—such as wearables and IoT endpoints—this paper proposes EmbBERT-Q, an ultra-lightweight NLP model. Methodologically, it introduces: (i) a compact Transformer architecture specifically designed for embedded deployment; (ii) a hardware-aware 8-bit quantization scheme jointly applied to weights and activations; and (iii) the TinyNLP benchmark, a resource-efficient evaluation suite, validated across GLUE for cross-benchmark robustness. EmbBERT-Q achieves state-of-the-art performance on both TinyNLP and GLUE while requiring only 781 kB of memory—25× smaller than the current best embedded NLP models. Under a strict 2 MB memory budget, it further surpasses compressed variants of BERT and Mamba in accuracy. This work bridges the gap between foundation-model capabilities and practical deployment on ultra-low-resource devices.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have revolutionized natural language processing, setting new standards across a wide range of applications. However, their relevant memory and computational demands make them impractical for deployment on technologically-constrained tiny devices such as wearable devices and Internet-of-Things units. To address this limitation, we introduce EmbBERT-Q, a novel tiny language model specifically designed for tiny devices with stringent memory constraints. EmbBERT-Q achieves state-of-the-art (SotA) accuracy in Natural Language Processing tasks in this scenario, with a total memory footprint (weights and activations) of just 781 kB, representing a 25x reduction in size with respect to SotA models. By combining architectural innovations with hardware-compatible 8-bit quantization, EmbBERT-Q consistently outperforms several baseline models scaled down to a 2 MB memory budget (i.e., the maximum memory typically available in tiny devices), including heavily compressed versions of BERT and MAMBA. Extensive experimental evaluations on both a selected benchmark dataset, TinyNLP, specifically curated to evaluate Tiny Language Models in NLP tasks and real-world scenarios, and the GLUE benchmark, demonstrate EmbBERT-Q ability to deliver competitive accuracy with respect to existing approaches, achieving an unmatched balance between memory and performance. To ensure the complete and immediate reproducibility of all our results, we release all code, scripts, and model checkpoints at https://github.com/RiccardoBravin/tiny-LLM.
Problem

Research questions and friction points this paper is trying to address.

Reduces memory footprint for tiny devices
Achieves high accuracy with minimal memory
Optimizes NLP models for IoT and wearables
Innovation

Methods, ideas, or system contributions that make the work stand out.

EmbBERT-Q for tiny devices
8-bit quantization technique
781 kB memory footprint
🔎 Similar Papers
No similar papers found.