Discovering Association Rules in High-Dimensional Small Tabular Data

📅 2025-09-24

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Association rule mining in high-dimensional, low-sample tabular data (e.g., gene expression data with ~18k features and 50 samples) suffers from combinatorial rule explosion, prohibitive computational cost, and severe performance degradation of existing methods (e.g., Aerial+). Method: This work formally defines association rule mining in this challenging regime and proposes two fine-tuning strategies leveraging tabular foundation models, integrated within the Aerial+ neuro-symbolic framework to enhance feature representation and rule generation. Contribution/Results: Experiments on five real-world datasets demonstrate substantial improvements in rule quality and interpretability. The method achieves 1–2 orders of magnitude better scalability than state-of-the-art approaches, enabling effective knowledge discovery and interpretable machine learning in high-dimensional, data-scarce settings.

Technology Category

Application Category

📝 Abstract

Association Rule Mining (ARM) aims to discover patterns between features in datasets in the form of propositional rules, supporting both knowledge discovery and interpretable machine learning in high-stakes decision-making. However, in high-dimensional settings, rule explosion and computational overhead render popular algorithmic approaches impractical without effective search space reduction, challenges that propagate to downstream tasks. Neurosymbolic methods, such as Aerial+, have recently been proposed to address the rule explosion in ARM. While they tackle the high dimensionality of the data, they also inherit limitations of neural networks, particularly reduced performance in low-data regimes. This paper makes three key contributions to association rule discovery in high-dimensional tabular data. First, we empirically show that Aerial+ scales one to two orders of magnitude better than state-of-the-art algorithmic and neurosymbolic baselines across five real-world datasets. Second, we introduce the novel problem of ARM in high-dimensional, low-data settings, such as gene expression data from the biomedicine domain with around 18k features and 50 samples. Third, we propose two fine-tuning approaches to Aerial+ using tabular foundation models. Our proposed approaches are shown to significantly improve rule quality on five real-world datasets, demonstrating their effectiveness in low-data, high-dimensional scenarios.

Problem

Research questions and friction points this paper is trying to address.

Addressing rule explosion and computational overhead in high-dimensional association rule mining

Solving association rule mining challenges in high-dimensional, low-data tabular settings

Improving rule quality for small tabular data with many features using fine-tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Scales significantly better than existing baselines

Introduces ARM for high-dimensional low-data settings

Proposes fine-tuning approaches using tabular foundation models

🔎 Similar Papers

TabGraphs: A Benchmark and Strong Baselines for Learning on Graphs with Tabular Node Features

2024-09-22arXiv.orgCitations: 2

Revisiting Nearest Neighbor for Tabular Data: A Deep Tabular Baseline Two Decades Later

2024-07-03Citations: 5

Microsoft

$119,800 -

United States, Washington, Redmond

Authors to Follow