๐ค AI Summary
Traditional decision trees rely on greedy strategies due to their discrete, non-differentiable nature and combinatorial complexity, hindering global optimization and seamless integration with gradient-based methods. This work proposes a gradient descentโbased approach for training hard, axis-aligned decision trees by employing a dense tree representation coupled with straight-through operators, enabling end-to-end joint optimization of all parameters. For the first time, this method achieves fully differentiable decision tree training, overcoming the limitations of greedy construction and natively supporting gradient-dependent frameworks such as multimodal learning and reinforcement learning. Experimental results demonstrate state-of-the-art performance in interpretable tabular data modeling, complex tabular tasks, and interpretable reinforcement learning without information loss.
๐ Abstract
Tree-based models are widely recognized for their interpretability and have proven effective in various application domains, particularly in high-stakes domains. However, learning decision trees (DTs) poses a significant challenge due to their combinatorial complexity and discrete, non-differentiable nature. As a result, traditional methods such as CART, which rely on greedy search procedures, remain the most widely used approaches. These methods make locally optimal decisions at each node, constraining the search space and often leading to suboptimal tree structures. Additionally, their demand for custom training methods precludes a seamless integration into modern machine learning (ML) approaches.
In this thesis, we propose a novel method for learning hard, axis-aligned DTs through gradient descent. Our approach utilizes backpropagation with a straight-through operator on a dense DT representation, enabling the joint optimization of all tree parameters, thereby addressing the two primary limitations of traditional DT algorithms. First, gradient-based training is not constrained by the sequential selection of locally optimal splits but, instead, jointly optimizes all tree parameters. Second, by leveraging gradient descent for optimization, our approach seamlessly integrates into existing ML approaches e.g., for multimodal and reinforcement learning tasks, which inherently rely on gradient descent.
These advancements allow us to achieve state-of-the-art results across multiple domains, including interpretable DTs rees for small tabular datasets, advanced models for complex tabular data, multimodal learning, and interpretable reinforcement learning without information loss. By bridging the gap between DTs and gradient-based optimization, our method significantly enhances the performance and applicability of tree-based models across various ML domains.