🤖 AI Summary
To address the lack of hardware acceleration for tree-based models (e.g., XGBoost, CatBoost) on tabular data, this paper proposes a simulation–digital hybrid inference architecture tailored for scientific discovery. Methodologically, it integrates three key innovations: (1) co-design of high-precision analog content-addressable memory (CAM) with programmable chips; (2) a novel hardware-aware training methodology that jointly optimizes gradient-boosted tree models and analog circuit characteristics; and (3) synergistic optimization of tree-model compilation and in-memory computing circuit design. Experimental results demonstrate that, compared to state-of-the-art GPUs, the architecture achieves 119× higher throughput, 9,740× lower latency, and over 150× improved energy efficiency, with a peak power consumption of only 19 W. This work significantly advances efficient, intelligent inference on structured data.
📝 Abstract
Structured, or tabular, data are the most common format in data science. While deep learning models have proven formidable in learning from unstructured data such as images or speech, they are less accurate than simpler approaches when learning from tabular data. In contrast, modern tree-based machine learning (ML) models shine in extracting relevant information from structured data. An essential requirement in data science is to reduce model inference latency in cases where, for example, models are used in a closed loop with simulation to accelerate scientific discovery. However, the hardware acceleration community has mostly focused on deep neural networks and largely ignored other forms of ML. Previous work has described the use of an analog content addressable memory (CAM) component for efficiently mapping random forests (RFs). In this work, we develop an analog-digital architecture that implements a novel increased precision analog CAM and a programmable chip for inference of state-of-the-art tree-based ML models, such as eXtreme Gradient Boosting (XGBoost), Categorical Boosting (CatBoost), and others. Thanks to hardware-aware training, X-TIME reaches state-of-the-art accuracy and <inline-formula> <tex-math notation="LaTeX">$119 imes $ </tex-math></inline-formula> higher throughput at <inline-formula> <tex-math notation="LaTeX">$9740 imes $ </tex-math></inline-formula> lower latency with <inline-formula> <tex-math notation="LaTeX">${gt }150 imes $ </tex-math></inline-formula> improved energy efficiency compared with a state-of-the-art GPU for models with up to 4096 trees and depth of 8, with a 19-W peak power consumption.