RFOD: Random Forest-based Outlier Detection for Tabular Data

📅 2025-10-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address semantic loss, poor interpretability, and modeling challenges in anomaly detection for mixed-type tabular data, this paper proposes RFOD—a Random Forest-based Outlier Detection framework. RFOD reformulates anomaly detection as a feature-level conditional reconstruction task, employing independent random forests to model each feature without imposing global distributional assumptions. It introduces two key innovations: Adjusted Gower Distance (AGD) for heterogeneous feature similarity computation, and Uncertainty-Weighted Averaging (UWA) for aggregating cell-level reconstruction errors—enabling fine-grained, interpretable, cell-wise anomaly scoring. Crucially, RFOD inherently preserves categorical semantics without manual encoding. Evaluated on 15 real-world heterogeneous datasets, RFOD consistently outperforms state-of-the-art methods across detection accuracy, robustness, scalability, and interpretability—demonstrating particular efficacy for high-stakes tabular anomaly identification.

Technology Category

Application Category

📝 Abstract
Outlier detection in tabular data is crucial for safeguarding data integrity in high-stakes domains such as cybersecurity, financial fraud detection, and healthcare, where anomalies can cause serious operational and economic impacts. Despite advances in both data mining and deep learning, many existing methods struggle with mixed-type tabular data, often relying on encoding schemes that lose important semantic information. Moreover, they frequently lack interpretability, offering little insight into which specific values cause anomalies. To overcome these challenges, we introduce extsf{ extbf{RFOD}}, a novel extsf{ extbf{R}}andom extsf{ extbf{F}}orest-based extsf{ extbf{O}}utlier extsf{ extbf{D}}etection framework tailored for tabular data. Rather than modeling a global joint distribution, extsf{RFOD} reframes anomaly detection as a feature-wise conditional reconstruction problem, training dedicated random forests for each feature conditioned on the others. This design robustly handles heterogeneous data types while preserving the semantic integrity of categorical features. To further enable precise and interpretable detection, extsf{RFOD} combines Adjusted Gower's Distance (AGD) for cell-level scoring, which adapts to skewed numerical data and accounts for categorical confidence, with Uncertainty-Weighted Averaging (UWA) to aggregate cell-level scores into robust row-level anomaly scores. Extensive experiments on 15 real-world datasets demonstrate that extsf{RFOD} consistently outperforms state-of-the-art baselines in detection accuracy while offering superior robustness, scalability, and interpretability for mixed-type tabular data.
Problem

Research questions and friction points this paper is trying to address.

Detects outliers in mixed-type tabular data
Handles heterogeneous data without losing semantic information
Provides interpretable anomaly detection with cell-level insights
Innovation

Methods, ideas, or system contributions that make the work stand out.

Random Forest-based outlier detection for tabular data
Feature-wise conditional reconstruction using dedicated forests
Cell-level scoring with Adjusted Gower's Distance and Uncertainty-Weighted Averaging
🔎 Similar Papers
No similar papers found.
Y
Yihao Ang
National University of Singapore
P
Peicheng Yao
National University of Singapore
Y
Yifan Bao
National University of Singapore
Y
Yushuo Feng
Huazhong University of Science & Technology
Q
Qiang Huang
Harbin Institute of Technology (Shenzhen)
A
Anthony K. H. Tung
National University of Singapore
Zhiyong Huang
Zhiyong Huang
Associate Professor, Department of Computer Science, School of Computing, NUS
Machine LearningComputer GraphicsComputer VisionMultimediaDatabases