CaliciBoost: Performance-Driven Evaluation of Molecular Representations for Caco-2 Permeability Prediction

📅 2025-06-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Accurate prediction of oral absorption in early drug discovery remains challenging due to insufficient predictive performance of existing models. Method: This study systematically evaluates eight classes of molecular representations—including 2D/3D descriptors, ECFP/MACCS fingerprints, and GNN embeddings—for Caco-2 permeability prediction. Leveraging AutoML frameworks (TPOT and H2O), we propose CaliciBoost, a novel ensemble model optimized for small-sample ADMET modeling. Contribution/Results: We demonstrate for the first time that 3D descriptors significantly reduce mean absolute error (MAE) by 15.73%. CaliciBoost achieves state-of-the-art performance on both the TDC and OCHEM benchmark datasets, yielding the lowest MAE. Furthermore, we establish a reusable, principled framework for molecular representation selection tailored to low-data ADMET tasks. This work provides both methodological guidance and open-source tools for high-confidence oral absorption prediction.

Technology Category

Application Category

📝 Abstract
Caco-2 permeability serves as a critical in vitro indicator for predicting the oral absorption of drug candidates during early-stage drug discovery. To enhance the accuracy and efficiency of computational predictions, we systematically investigated the impact of eight molecular feature representation types including 2D/3D descriptors, structural fingerprints, and deep learning-based embeddings combined with automated machine learning techniques to predict Caco-2 permeability. Using two datasets of differing scale and diversity (TDC benchmark and curated OCHEM data), we assessed model performance across representations and identified PaDEL, Mordred, and RDKit descriptors as particularly effective for Caco-2 prediction. Notably, the AutoML-based model CaliciBoost achieved the best MAE performance. Furthermore, for both PaDEL and Mordred representations, the incorporation of 3D descriptors resulted in a 15.73% reduction in MAE compared to using 2D features alone, as confirmed by feature importance analysis. These findings highlight the effectiveness of AutoML approaches in ADMET modeling and offer practical guidance for feature selection in data-limited prediction tasks.
Problem

Research questions and friction points this paper is trying to address.

Evaluating molecular representations for Caco-2 permeability prediction
Comparing 2D/3D descriptors and AutoML techniques for accuracy
Identifying optimal feature sets to reduce prediction error
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combined 2D/3D descriptors with AutoML
Used PaDEL, Mordred, RDKit for prediction
Achieved 15.73% MAE reduction with 3D
🔎 Similar Papers
No similar papers found.
H
Huong Van Le
Calici Co., Ltd
W
Weibin Ren
Calici Co., Ltd
J
Junhong Kim
Calici Co., Ltd
Y
Y. Yun
Calici Co., Ltd
Y
Young Bin Park
Calici Co., Ltd
Y
Young Jun Kim
Korea University
B
Bok Kyung Han
Korea University, InsightFI Co., Ltd
J
Jong IL Park
Chungnam National University
H
Hwi-Yeol Yun
Chungnam National University
Jae-Mun Choi
Jae-Mun Choi
Calici Co., Ltd
AI Drig Discovery & Development