Integrated Multivariate Segmentation Tree for the Analysis of Heterogeneous Credit Data in Small and Medium-Sized Enterprises

📅 2025-08-30

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

To address the challenges of integrating high-dimensional heterogeneous data (financial numerical features and unstructured text) and the inability of conventional decision trees to effectively exploit textual information in SME credit assessment, this paper proposes the Multivariate Splitting Tree (MST) model. MST innovatively integrates matrix-based text representation, Lasso-driven sparse feature selection, and a multivariate decision tree structure. It employs a joint Gini-entropy splitting criterion, matrix decomposition for dimensionality reduction, and weakest-link pruning—ensuring both interpretability and structural simplicity while enabling efficient heterogeneous information fusion. Evaluated on a real-world dataset of 1,428 Chinese enterprises, MST achieves an accuracy of 88.9%, significantly outperforming traditional decision trees, logistic regression, and SVM. The model demonstrates superior risk discrimination capability and computational efficiency, offering a scalable, interpretable solution for credit scoring with mixed data modalities.

Technology Category

Application Category

📝 Abstract

Traditional decision tree models, which rely exclusively on numerical variables, often encounter difficulties in handling high-dimensional data and fail to effectively incorporate textual information. To address these limitations, we propose the Integrated Multivariate Segmentation Tree (IMST), a comprehensive framework designed to enhance credit evaluation for small and medium-sized enterprises (SMEs) by integrating financial data with textual sources. The methodology comprises three core stages: (1) transforming textual data into numerical matrices through matrix factorization; (2) selecting salient financial features using Lasso regression; and (3) constructing a multivariate segmentation tree based on the Gini index or Entropy, with weakest-link pruning applied to regulate model complexity. Experimental results derived from a dataset of 1,428 Chinese SMEs demonstrate that IMST achieves an accuracy of 88.9%, surpassing baseline decision trees (87.4%) as well as conventional models such as logistic regression and support vector machines (SVM). Furthermore, the proposed model exhibits superior interpretability and computational efficiency, featuring a more streamlined architecture and enhanced risk detection capabilities.

Problem

Research questions and friction points this paper is trying to address.

Enhances credit evaluation for SMEs by integrating financial and textual data

Overcomes limitations of traditional decision trees with high-dimensional data

Improves accuracy and interpretability in heterogeneous credit analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates financial and textual data via matrix factorization

Selects key financial features using Lasso regression

Builds multivariate segmentation tree with pruning technique

🔎 Similar Papers

No similar papers found.