🤖 AI Summary
For right-censored survival data, the log-rank splitting criterion in random survival forests incurs high computational cost—O(M) recomputation per split—hindering scalability to large cohorts. To address this, we propose a constant-time incremental update method for the log-rank statistic. Building upon LeBlanc and Crowley’s (1995) approximation, our approach eliminates redundant full-scale computations at candidate splits while preserving the original predictive performance. We implement this optimization within the generalized random forests (grf) framework and empirically validate its scalability on large-scale survival datasets: training speed improves by several-fold to over an order of magnitude, depending on cohort size, without compromising survival prediction accuracy or statistical consistency. This work establishes an efficient and robust foundation for tree-based ensemble modeling in high-dimensional, large-scale survival analysis.
📝 Abstract
Random survival forests are widely used for estimating covariate-conditional survival functions under right-censoring. Their standard log-rank splitting criterion is typically recomputed at each candidate split. This O(M) cost per split, with M the number of distinct event times in a node, creates a bottleneck for large cohort datasets with long follow-up. We revisit approximations proposed by LeBlanc and Crowley (1995) and develop simple constant-time updates for the log-rank criterion. The method is implemented in grf and substantially reduces training time on large datasets while preserving predictive performance.