Class-Imbalanced-Aware Adaptive Dataset Distillation for Scalable Pretrained Model on Credit Scoring

📅 2025-01-18

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Addressing the challenges of scarce sample size and severe class imbalance in tabular data for financial credit scoring, this paper proposes an adaptive data distillation framework tailored for pre-trained large models such as TabPFN. The method innovatively integrates class-imbalance awareness into the distillation process—specifically, via a class-weighted distillation loss and gradient-matching-driven adaptive sampling—to jointly optimize the distilled data distribution and alignment with downstream task objectives. Empirically, the approach improves AUC by 2.5 percentage points on real-world credit datasets and substantially enhances TabPFN’s generalization and deployment scalability under extremely limited supervision (e.g., only hundreds of labeled samples). It establishes a transferable, lightweight adaptation paradigm for few-shot, imbalanced tabular learning—bridging the gap between large pre-trained models and practical, resource-constrained financial applications.

Technology Category

Application Category

📝 Abstract

The advent of artificial intelligence has significantly enhanced credit scoring technologies. Despite the remarkable efficacy of advanced deep learning models, mainstream adoption continues to favor tree-structured models due to their robust predictive performance on tabular data. Although pretrained models have seen considerable development, their application within the financial realm predominantly revolves around question-answering tasks and the use of such models for tabular-structured credit scoring datasets remains largely unexplored. Tabular-oriented large models, such as TabPFN, has made the application of large models in credit scoring feasible, albeit can only processing with limited sample sizes. This paper provides a novel framework to combine tabular-tailored dataset distillation technique with the pretrained model, empowers the scalability for TabPFN. Furthermore, though class imbalance distribution is the common nature in financial datasets, its influence during dataset distillation has not been explored. We thus integrate the imbalance-aware techniques during dataset distillation, resulting in improved performance in financial datasets (e.g., a 2.5% enhancement in AUC). This study presents a novel framework for scaling up the application of large pretrained models on financial tabular datasets and offers a comparative analysis of the influence of class imbalance on the dataset distillation process. We believe this approach can broaden the applications and downstream tasks of large models in the financial domain.

Problem

Research questions and friction points this paper is trying to address.

Credit Scoring

Imbalanced Financial Data

Model Improvement

Innovation

Methods, ideas, or system contributions that make the work stand out.

Imbalanced Data Handling

Pre-trained Models

Financial Credit Scoring

🔎 Similar Papers

No similar papers found.

💼 Related Jobs

Principal Machine Learning Engineer

Genentech

South San Francisco, California, United States of America

Machine Learning Engineer