🤖 AI Summary
This work addresses the memory bandwidth bottleneck that limits the energy efficiency and throughput of autoregressive inference with small language models (SLMs) on edge devices. To overcome this challenge, the authors propose EdgeCIM, the first framework integrating compute-in-memory (CIM) with holistic hardware-software co-design. By leveraging a 65nm CIM macro, INT4 quantization, and a tiling-aware weight mapping strategy, EdgeCIM optimizes the end-to-end decoding pipeline to alleviate DRAM bottlenecks. Evaluated on LLaMA3.2-1B, EdgeCIM achieves a 7.3× higher throughput—reaching 336.42 tokens/s—and a 49.59× improvement in energy efficiency, attaining 173.02 tokens/J, compared to NVIDIA Orin Nano. The design also enables exploration of models up to 4 billion parameters within the edge deployment regime.
📝 Abstract
The growing demand for deploying Small Language Models (SLMs) on edge devices, including laptops, smartphones, and embedded platforms, has exposed fundamental inefficiencies in existing accelerators. While GPUs handle prefill workloads efficiently, the autoregressive decoding phase is dominated by GEMV operations that are inherently memory-bound, resulting in poor utilization and prohibitive energy costs at the edge. In this work, we present EdgeCIM, a hardware-software co-design framework that rethinks accelerator design for end-to-end decoder-only inference. At its core is a CIM macro, implemented in 65nm, coupled with a tile-based mapping strategy that balances pipeline stages, maximizing parallelism while alleviating DRAM bandwidth bottlenecks. Our simulator enables design space exploration of SLMs up to 4B parameters, identifying Pareto-optimal configurations in terms of latency and energy. Compared to an NVIDIA Orin Nano, EdgeCIM achieves up to 7.3x higher throughput and 49.59x better energy efficiency on LLaMA3.2-1B, and delivers 9.95x higher throughput than Qualcomm SA8255P on LLaMA3.2-3B. Extensive benchmarks on TinyLLaMA-1.1B, LLaMA3.2 (1B, 3B), Phi-3.5-mini-3.8B, Qwen2.5 (0.5B, 1.5B, 3B), SmolLM2-1.7B, SmolLM3-3B, and Qwen3 (0.6B, 1.7B, 4B) reveal that our accelerator, under INT4 precision, achieves on average 336.42 tokens/s and 173.02 tokens/J. These results establish EdgeCIM as a compelling solution towards real-time, energy-efficient edge-scale SLM inference.