Scalable Tree Ensemble Proximities in Python

📅 2026-01-06

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This work addresses the scalability limitations of traditional tree ensemble methods for computing sample similarity, which suffer from quadratic time and memory complexity due to explicit pairwise comparisons. The authors propose a separable, weighted leaf collision similarity framework that formalizes a novel class of similarity measures amenable to exact sparse matrix factorization. By restricting computations to the leaf-node level and leveraging the inherent sparsity of tree-based representations, the method eliminates the need for explicit pairwise evaluations. Integrating decision tree structures with sparse linear algebra techniques, the approach enables efficient nearest-neighbor computation on standard CPUs, scaling to hundreds of thousands of samples while substantially reducing both runtime and memory consumption—thereby overcoming a fundamental bottleneck in the scalability of conventional tree-based similarity methods.

Technology Category

Application Category

📝 Abstract

Tree ensemble methods such as Random Forests naturally induce supervised similarity measures through their decision tree structure, but existing implementations of proximities derived from tree ensembles typically suffer from quadratic time or memory complexity, limiting their scalability. In this work, we introduce a general framework for efficient proximity computation by defining a family of Separable Weighted Leaf-Collision Proximities. We show that any proximity measure in this family admits an exact sparse matrix factorization, restricting computation to leaf-level collisions and avoiding explicit pairwise comparisons. This formulation enables low-memory, scalable proximity computation using sparse linear algebra in Python. Empirical benchmarks demonstrate substantial runtime and memory improvements over traditional approaches, allowing tree ensemble proximities to scale efficiently to datasets with hundreds of thousands of samples on standard CPU hardware.

Problem

Research questions and friction points this paper is trying to address.

tree ensemble

proximity

scalability

quadratic complexity

similarity measure

Innovation

Methods, ideas, or system contributions that make the work stand out.

tree ensemble proximities

scalable similarity

sparse matrix factorization