Algebraic Machine Learning for Small-to-Medium Datasets Is Competitive against Strong Standard Baselines

📅 2026-05-21

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work proposes a machine learning approach grounded in the subdirect decomposition of algebraic structures, introducing a universal algebraic inductive bias that requires no modality-specific design, task-dependent hyperparameters, cross-validation, or numerical optimization. On small to medium-sized datasets (50–2000 samples), where existing symbolic methods often struggle to match modern strong baselines, the proposed method demonstrates competitive performance: it outperforms cross-validated convolutional neural networks (CNNs) on image classification tasks and achieves results comparable to LightGBM and random forests on tabular data. These findings highlight the potential of algebraic machine learning to generalize effectively in low-data regimes without relying on conventional optimization or extensive hyperparameter tuning.

📝 Abstract

Symbolic methods are generally not considered competitive with strong modern learners on realistic supervised tasks. We evaluate Algebraic Machine Learning (AML), a framework that learns through subdirect decomposition of algebraic structure rather than numerical optimization, against standard baselines on image and tabular classification across varying training-set sizes. We find that AML trained only on training data without using validation or cross-validation outperforms a family of cross-validated baseline methods including CNNs on small to medium image datasets (50--2000 training examples). On tabular datasets in the same size range, XGBoost is overall the best performing method, but AML is nonetheless comparable to methods incorporating task-specific biases such as LightGBM and random forests. AML achieves this competitive performance across two very different types of datasets using a generic algebraic inductive bias, rather than the modality-specific biases built into standard baselines like CNNs for images or XGBoost for tabular data, and requires no cross validation because it has no task-dependent hyperparameters to tune.

Problem

Research questions and friction points this paper is trying to address.

Algebraic Machine Learning

small-to-medium datasets

symbolic methods

supervised classification

inductive bias

Innovation

Methods, ideas, or system contributions that make the work stand out.

Algebraic Machine Learning

subdirect decomposition

inductive bias