🤖 AI Summary
This work addresses the challenge of enhancing the semantic generalization capability of symbolic models without compromising their interpretability. It proposes a purely symbolic approach that leverages large language models (LLMs) only offline to generate sub-intent guidance for synthesizing training data, thereby avoiding any embedding or runtime LLM invocation. The method employs a three-stage curriculum to train a non-negated Tsetlin Machine and extracts high-confidence literals as semantic cues, which are then injected into real data to align symbolic logic with LLM-derived semantics. For the first time, this framework fully symbolizes LLM semantic priors and integrates them into the Tsetlin Machine. Evaluated on multiple text classification benchmarks, the approach significantly outperforms the original Tsetlin Machine while preserving computational efficiency and full interpretability, achieving accuracy comparable to BERT.
📝 Abstract
Pretrained language models (PLMs) like BERT provide strong semantic representations but are costly and opaque, while symbolic models such as the Tsetlin Machine (TM) offer transparency but lack semantic generalization. We propose a semantic bootstrapping framework that transfers LLM knowledge into symbolic form, combining interpretability with semantic capacity. Given a class label, an LLM generates sub-intents that guide synthetic data creation through a three-stage curriculum (seed, core, enriched), expanding semantic diversity. A Non-Negated TM (NTM) learns from these examples to extract high-confidence literals as interpretable semantic cues. Injecting these cues into real data enables a TM to align clause logic with LLM-inferred semantics. Our method requires no embeddings or runtime LLM calls, yet equips symbolic models with pretrained semantic priors. Across multiple text classification tasks, it improves interpretability and accuracy over vanilla TM, achieving performance comparable to BERT while remaining fully symbolic and efficient.