Ember: A Compiler for Efficient Embedding Operations on Decoupled Access-Execute Architectures

📅 2025-04-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Irregular embedding lookups in recommender systems, sparse large language models, and graph learning impose severe performance bottlenecks on conventional hardware. Method: This paper proposes the Decoupled Access-Execution (DAE) hardware architecture and introduces the first DAE-aware multi-level intermediate representation (IR) compiler framework, enabling end-to-end automatic optimization. The framework integrates with PyTorch and TensorFlow frontends and jointly optimizes embedding operation scheduling and memory access patterns while preserving semantic correctness. Contribution/Results: It achieves, for the first time, compiler-generated code performance on par with hand-optimized kernels. Evaluated on end-to-end models, the system delivers 2.6× higher throughput and 6.4× better energy efficiency over state-of-the-art GPUs, fully unlocking the potential of the DAE architecture.

Technology Category

Application Category

📝 Abstract
Irregular embedding lookups are a critical bottleneck in recommender models, sparse large language models, and graph learning models. In this paper, we first demonstrate that, by offloading these lookups to specialized access units, Decoupled Access-Execute (DAE) processors achieve 2.6$ imes$ higher performance and 6.4$ imes$ higher performance/watt than GPUs on end-to-end models. Then, we propose the Ember compiler for automatically generating optimized DAE code from PyTorch and TensorFlow. Conversely from other DAE compilers, Ember features multiple intermediate representations specifically designed for different optimization levels. In this way, Ember can implement all optimizations to match the performance of hand-written code, unlocking the full potential of DAE architectures at scale.
Problem

Research questions and friction points this paper is trying to address.

Addressing irregular embedding lookups in recommender and graph models
Improving performance and efficiency via Decoupled Access-Execute processors
Automating optimized code generation from PyTorch/TensorFlow for DAE architectures
Innovation

Methods, ideas, or system contributions that make the work stand out.

Specialized access units for embedding lookups
Ember compiler for DAE code generation
Multiple intermediate representations for optimization
🔎 Similar Papers
No similar papers found.
Marco Siracusa
Marco Siracusa
Barcelona Supercomputing Center
Compiler TechnologiesComputer ArchitecturesHigh-Performance ComputingMachine Learning
Olivia Hsu
Olivia Hsu
Stanford University
CompilersComputer ArchitectureSparse ComputationProgramming LanguagesVLSI
V
Víctor Soria-Pardos
Barcelona Supercomputing Center
J
Joshua Randall
Arm
Arnaud Grasset
Arnaud Grasset
Arm
E
Eric Biscondi
Arm
D
Doug Joseph
Arm
R
Randy Allen
Barcelona Supercomputing Center
Fredrik Kjolstad
Fredrik Kjolstad
Assistant Professor, Stanford University
CompilersProgramming LanguagesSparse ComputationPerformance Engineering
M
Miquel Moret'o Planas
Barcelona Supercomputing Center, Universitat Politècnica de Catalunya
A
Adria Armejach
Barcelona Supercomputing Center, Universitat Politècnica de Catalunya