🤖 AI Summary
Addressing the challenges of scarce sample size and severe class imbalance in tabular data for financial credit scoring, this paper proposes an adaptive data distillation framework tailored for pre-trained large models such as TabPFN. The method innovatively integrates class-imbalance awareness into the distillation process—specifically, via a class-weighted distillation loss and gradient-matching-driven adaptive sampling—to jointly optimize the distilled data distribution and alignment with downstream task objectives. Empirically, the approach improves AUC by 2.5 percentage points on real-world credit datasets and substantially enhances TabPFN’s generalization and deployment scalability under extremely limited supervision (e.g., only hundreds of labeled samples). It establishes a transferable, lightweight adaptation paradigm for few-shot, imbalanced tabular learning—bridging the gap between large pre-trained models and practical, resource-constrained financial applications.
📝 Abstract
The advent of artificial intelligence has significantly enhanced credit scoring technologies. Despite the remarkable efficacy of advanced deep learning models, mainstream adoption continues to favor tree-structured models due to their robust predictive performance on tabular data. Although pretrained models have seen considerable development, their application within the financial realm predominantly revolves around question-answering tasks and the use of such models for tabular-structured credit scoring datasets remains largely unexplored. Tabular-oriented large models, such as TabPFN, has made the application of large models in credit scoring feasible, albeit can only processing with limited sample sizes. This paper provides a novel framework to combine tabular-tailored dataset distillation technique with the pretrained model, empowers the scalability for TabPFN. Furthermore, though class imbalance distribution is the common nature in financial datasets, its influence during dataset distillation has not been explored. We thus integrate the imbalance-aware techniques during dataset distillation, resulting in improved performance in financial datasets (e.g., a 2.5% enhancement in AUC). This study presents a novel framework for scaling up the application of large pretrained models on financial tabular datasets and offers a comparative analysis of the influence of class imbalance on the dataset distillation process. We believe this approach can broaden the applications and downstream tasks of large models in the financial domain.