🤖 AI Summary
To address poor interpretability and high computational cost of black-box models on low-resource tabular data, this paper proposes a decision tree generation paradigm leveraging the structured reasoning capabilities of large language models (LLMs). Methodologically, we design a lightweight toolset enabling the LLM to act as an intelligent agent that jointly integrates domain priors and data-driven learning to construct editable and auditable decision trees; human-in-the-loop intervention is supported for bias correction and domain-knowledge injection, with explicit, inspectable reasoning traces generated throughout. Our key contribution is the first systematic integration of LLMs’ structured reasoning into decision tree induction, achieving a balanced trade-off among predictive performance, interpretability, and controllability. Experiments demonstrate that our approach significantly outperforms CART in low-resource settings, while remaining competitive—though slightly inferior—to state-of-the-art black-box models. Crucially, it yields lightweight, fully transparent, and production-deployable decision tree models.
📝 Abstract
Tabular foundation models are becoming increasingly popular for low-resource tabular problems. These models make up for small training datasets by pretraining on large volumes of synthetic data. The prior knowledge obtained via pretraining provides the exceptional performance, but the resulting model becomes a black box that is difficult to interpret and costly to inference. In this work, we explore an alternative strategy: using reasoning-capable LLMs to induce decision trees for small tabular datasets in agentic setup. We design a minimal set of tools for constructing, analyzing and manipulating decision trees. By using these tools, LLMs combine their prior knowledge with learning from data to create a lightweight decision tree that outperforms traditional CART on low-resource tabular problems. While a single decision tree does not outperform state-of-the-art black box models, it comes with a human-readable reasoning trace that can be checked for biases and data leaks. Furthermore, the reasoning-based LLM's creation process allows for additional human input: correcting biases or incorporating domain-specific intuition that is not captured in the data.