Programmatic Representation Learning with Language Models

📅 2025-10-16

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

Traditional supervised models (e.g., decision trees) rely on manual feature engineering, while neural networks automatically learn representations at the cost of interpretability and reliance on specialized hardware. To address this trade-off, we propose LeaPR (Learning Programmatic Representations), a framework that unifies large language model (LLM)-driven program synthesis with interpretable decision tree learning. LeaPR leverages LLMs to generate semantically meaningful, executable code-based features on demand; it integrates these features into a novel ID3 variant that dynamically invokes the LLM during node splitting to synthesize task-specific features, enabling end-to-end interpretable representation learning. Evaluated on chess position evaluation, image classification, and text classification tasks, LeaPR matches neural network accuracy without GPU acceleration, achieving high predictive performance, strong model interpretability, and robust generalization.

Technology Category

Application Category

📝 Abstract

Classical models for supervised machine learning, such as decision trees, are efficient and interpretable predictors, but their quality is highly dependent on the particular choice of input features. Although neural networks can learn useful representations directly from raw data (e.g., images or text), this comes at the expense of interpretability and the need for specialized hardware to run them efficiently. In this paper, we explore a hypothesis class we call Learned Programmatic Representations (LeaPR) models, which stack arbitrary features represented as code (functions from data points to scalars) and decision tree predictors. We synthesize feature functions using Large Language Models (LLMs), which have rich prior knowledge in a wide range of domains and a remarkable ability to write code using existing domain-specific libraries. We propose two algorithms to learn LeaPR models from supervised data. First, we design an adaptation of FunSearch to learn features rather than directly generate predictors. Then, we develop a novel variant of the classical ID3 algorithm for decision tree learning, where new features are generated on demand when splitting leaf nodes. In experiments from chess position evaluation to image and text classification, our methods learn high-quality, neural network-free predictors often competitive with neural networks. Our work suggests a flexible paradigm for learning interpretable representations end-to-end where features and predictions can be readily inspected and understood.

Problem

Research questions and friction points this paper is trying to address.

Learning interpretable programmatic representations using LLMs

Generating feature functions as code for decision trees

Creating neural network-free competitive predictors from data

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-synthesized code functions as features

Decision trees with on-demand feature generation

Interpretable end-to-end representation learning paradigm

🔎 Similar Papers

Do Large Code Models Understand Programming Concepts? Counterfactual Analysis for Code Predicates