FastOmniTMAE: Parallel Clause Learning for Scalable and Hardware-Efficient Tsetlin Embeddings

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

This work addresses the slow training speed and low GPU efficiency of static embedding models based on the Tsetlin Machine by proposing FastOmniTMAE, which introduces the first two-stage parallelization strategy for Tsetlin Machine–based embedding training. By decoupling the originally sequential training process into parallel evaluation and update phases, the method significantly accelerates training while preserving embedding quality. Furthermore, a reusable hardware accelerator is designed for deployment on resource-constrained SoC-FPGA platforms. Experimental results demonstrate up to a 5× speedup in training on classification tasks, with embedding similarity scores of 0.669 on FPGA and 0.696 on UltraScale+ SoC, matching the embedding quality of the original approach.

📝 Abstract

Embedding models in natural language processing (NLP) increasingly rely on deep architectures such as BERT, while simpler models such as Word2Vec provide efficient representations but limited interpretability. The Tsetlin Machine (TM) offers an alternative logic-based learning paradigm. Omni TM Autoencoder (Omni TM-AE) applies this paradigm to static embedding by exploiting automaton state distributions within a single clause layer, but its training process remains slow. In this work, we propose FastOmniTMAE, a reformulation of Omni TM-AE that replaces sequential training dependencies with a two-stage parallel process: evaluation and update. Using a Single-Run Multi-Environment Benchmark covering classification, similarity, and clustering, FastOmniTMAE achieves up to 5$\times$ faster training in classification while maintaining comparable embedding quality under both Spearman and Kendall similarity measures. To address the limited efficiency of TM training on conventional GPUs, we further implement FastOmniTMAE as a reusable accelerator on SoC-FPGA platforms. The Multi-Hardware Benchmark shows that FastOmniTMAE achieves similarity scores of 0.669 on a resource-constrained FPGA and 0.696 on an UltraScale+ SoC, demonstrating efficient logic-based embedding training with a small hardware footprint.

Problem

Research questions and friction points this paper is trying to address.

Tsetlin Machine

embedding

training efficiency

hardware efficiency

static embedding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Tsetlin Machine

parallel clause learning

hardware-efficient embedding