TabPFN-Wide: Continued Pre-Training for Extreme Feature Counts

📅 2025-10-07
📈 Citations: 0
Influential: 0
📄 PDF

career value

193K/year
🤖 AI Summary
Conventional tabular foundation models (e.g., TabPFN) fail on ultra-high-dimensional, sparse biomedical data (features >50K, samples extremely limited), as they are constrained to <500 features and lack intrinsic interpretability. Method: We propose the first continual pretraining framework for tabular foundation models tailored to extreme high-dimensional settings. It leverages customized prior distributions to generate synthetic data and incorporates noise-robust training to extend TabPFN’s input capacity beyond 50,000 dimensions while fully preserving its feature importance analysis capability. Results: On real-world molecular–pathological association tasks, our model matches or surpasses the original TabPFN in predictive performance. Identified biomarkers strongly align with established biological knowledge, and novel candidate mechanisms are uncovered. This work establishes a new paradigm for high-throughput biomedical discovery—scalable, interpretable, and foundation-model-driven.

Technology Category

Application Category

📝 Abstract
Revealing novel insights from the relationship between molecular measurements and pathology remains a very impactful application of machine learning in biomedicine. Data in this domain typically contain only a few observations but thousands of potentially noisy features, posing challenges for conventional machine learning approaches. While prior-data fitted networks emerge as foundation models for tabular data, they are currently not suited to handle large feature counts (>500). Although feature reduction enables their application, it hinders feature importance analysis. We propose a strategy that extends existing models through continued pre-training on synthetic data sampled from a customized prior. The resulting model, TabPFN-Wide, matches or exceeds its base model's performance while exhibiting improved robustness to noise. It seamlessly scales beyond 50,000 features, regardless of noise levels, while maintaining inherent interpretability, which is critical for biomedical applications. Our results show that prior-informed adaptation is suitable to enhance the capability of foundation models for high-dimensional data. On real-world biomedical datasets many of the most relevant features identified by the model overlap with previous biological findings, while others propose potential starting points for future studies.
Problem

Research questions and friction points this paper is trying to address.

Handling high-dimensional biomedical data with thousands of features
Overcoming limitations of foundation models for large feature counts
Maintaining interpretability while scaling to over 50,000 noisy features
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extended pre-training on synthetic data
Scaled to over 50,000 features
Maintained interpretability with noise robustness
🔎 Similar Papers
No similar papers found.
💼 Related Jobs
Postdoctoral Fellow – AI-Driven Multi-Omics Integration for Predictive Toxicology
Pfizer
The annual base salary for this position ranges from $64,600.00 to $107,600.00. In addition, this position is eligible for participation in Pfizer’s Global Performance Plan with a bonus target of 7.5% of the base salary. We offer comprehensive and generous benefits and programs to help our colleagues lead healthy lives and to support each of life’s moments. Benefits offered include a 401(k) plan with Pfizer Matching Contributions and an additional Pfizer Retirement Savings Contribution, paid vacation, holiday and personal days, paid caregiver/parental and medical leave, and health benefits to include medical, prescription drug, dental and vision coverage. Learn more at Pfizer Candidate Site – U.S. Benefits | (uscandidates.mypfizerbenefits.com). Pfizer compensation structures and benefit packages are aligned based on the location of hire. The United States salary range provided does not apply to Tampa, FL or any location outside of the United States. Relocation assistance may be available based on business needs and/or eligibility.
Hybrid
C
Christopher Kolberg
Department of Computer Science, University of Tübingen
Katharina Eggensperger
Katharina Eggensperger
Professor for ML and AI | Lamarr Institute, TU Dortmund University
AutoMLHyperparameter OptimizationBayesian OptimizationMeta-LearningTabular Data
N
Nico Pfeifer
Department of Computer Science, University of Tübingen