Refining embeddings with fill-tuning: data-efficient generalised performance improvements for materials foundation models

📅 2025-02-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Pretrained foundation models often suffer from inaccurate embeddings, limiting downstream performance; conventional fine-tuning improves task-specific accuracy but degrades out-of-distribution generalization. Method: We propose *fill-tuning*, a novel paradigm that identifies embedding deficiency regions via latent-space topological roughness analysis and generates a minimal set of targeted补全 data (only 100 samples) to drive lightweight continual pretraining—without task labels or gradient-based updates. Contribution/Results: Fill-tuning achieves the first global, non-destructive optimization of embedding quality, preserving pretrained knowledge while enhancing representational fidelity. Evaluated on a billion-parameter foundational model for materials science, it incurs computational overhead comparable to standard fine-tuning yet yields an average +1.0% improvement across all downstream tasks—significantly breaking the longstanding trade-off between fine-tuning and generalization degradation.

Technology Category

Application Category

📝 Abstract
Pretrained foundation models learn embeddings that can be used for a wide range of downstream tasks. These embeddings optimise general performance, and if insufficiently accurate at a specific task the model can be fine-tuned to improve performance. For all current methodologies this operation necessarily degrades performance on all out-of-distribution tasks. In this work we present 'fill-tuning', a novel methodology to generate datasets for continued pretraining of foundation models that are not suited to a particular downstream task, but instead aim to correct poor regions of the embedding. We present the application of roughness analysis to latent space topologies and illustrate how it can be used to propose data that will be most valuable to improving the embedding. We apply fill-tuning to a set of state-of-the-art materials foundation models trained on $O(10^9)$ data points and show model improvement of almost 1% in all downstream tasks with the addition of only 100 data points. This method provides a route to the general improvement of foundation models at the computational cost of fine-tuning.
Problem

Research questions and friction points this paper is trying to address.

Improve materials foundation models' embeddings.
Address performance degradation in out-of-distribution tasks.
Enhance general model performance with minimal data.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fill-tuning enhances embeddings
Roughness analysis improves latent space
Minimal data boosts model performance
🔎 Similar Papers
No similar papers found.
M
Matthew P. Wilson
IBM Research Europe, Hartree Centre, Daresbury, United Kingdom
Edward O. Pyzer-Knapp
Edward O. Pyzer-Knapp
Chief Scientific Officer | Editor in Chief, Applied AI Letters
Bayesian OptimizationQuantum AIComputational ChemistryChemical InformationChemical AI
N
Nicolas Galichet
IBM Research Europe, Hartree Centre, Daresbury, United Kingdom
L
Luke Dicks
Xyme, Manchester, United Kingdom