TOM: A Ternary Read-only Memory Accelerator for LLM-powered Edge Intelligence

📅 2026-02-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of deploying large language models on edge devices, where the "memory wall" severely limits the trade-off among model scale, inference speed, and on-device adaptability. To overcome this, the authors propose a co-design approach that integrates ternary quantization, standard-cell-based logic ROM, and QLoRA fine-tuning within a hybrid ROM-SRAM architecture. The design incorporates sparsity-aware storage, distributed processing, and dynamic power gating to enhance efficiency. Notably, it achieves the first practical support for high-density, tunable ternary weights in logic ROM, enabling a BitNet-2B implementation that delivers 3,306 tokens per second (TPS) inference throughput. This significantly improves real-time performance, energy efficiency, and local adaptation capabilities for edge deployment.

Technology Category

Application Category

📝 Abstract
The deployment of Large Language Models (LLMs) for real-time intelligence on edge devices is rapidly growing. However, conventional hardware architectures face a fundamental memory wall challenge, where limited on-device memory capacity and bandwidth severely constrain the size of deployable models and their inference speed, while also limiting on-device adaptation. To address this challenge, we propose TOM, a hybrid ROM-SRAM accelerator co-designed with ternary quantization, which balances extreme density with on-device tunability. TOM exploits the synergy between ternary quantization and ROM to achieve extreme memory density and bandwidth, while preserving flexibility through a hybrid ROM-SRAM architecture designed for QLoRA-based tunability. Specifically, we introduce: (1) a sparsity-aware ROM architecture that synthesizes ternary weights as standard-cell logic, eliminating area overhead from zero-valued bits; (2) a distributed processing architecture that co-locates high-density ROM banks with flexible SRAM-based QLoRA adapters and compute units; and (3) a workload-aware dynamic power gating scheme that exploits the logic-based nature of ROM to power down inactive banks, minimizing dynamic energy consumption. TOM achieves an inference throughput of 3,306 TPS using BitNet-2B model, demonstrating its effectiveness in delivering real-time, energy-efficient edge intelligence.
Problem

Research questions and friction points this paper is trying to address.

memory wall
edge intelligence
Large Language Models
on-device adaptation
memory bandwidth
Innovation

Methods, ideas, or system contributions that make the work stand out.

Ternary Quantization
ROM-SRAM Hybrid Architecture
QLoRA
Edge AI Accelerator
Memory Wall
🔎 Similar Papers
No similar papers found.