Quantifying Prediction Consistency Under Model Multiplicity in Tabular LLMs

📅 2024-07-04

🏛️ arXiv.org

📈 Citations: 4

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Fine-tuning large language models (LLMs) for tabular data often exhibits “fine-tuning multiplicity”—i.e., distinct yet equally performing models yield conflicting predictions on identical inputs—undermining reliability in high-stakes domains such as finance and healthcare. Method: We formally define the single-sample prediction consistency problem under fine-tuning multiplicity and propose a computationally tractable stability metric based on local neighborhood sampling in the embedding space. Leveraging Bernstein’s inequality, we derive probabilistic robustness guarantees without requiring retraining. Results: Evaluated across multiple real-world tabular datasets, our metric strongly correlates with empirical ensemble prediction consistency. Predictions certified as highly robust remain stable with >95% confidence, significantly enhancing deployment trustworthiness and operational reliability.

Technology Category

Application Category

📝 Abstract

Fine-tuning large language models (LLMs) on limited tabular data for classification tasks can lead to extit{fine-tuning multiplicity}, where equally well-performing models make conflicting predictions on the same inputs due to variations in the training process (i.e., seed, random weight initialization, retraining on additional or deleted samples). This raises critical concerns about the robustness and reliability of Tabular LLMs, particularly when deployed for high-stakes decision-making, such as finance, hiring, education, healthcare, etc. This work formalizes the challenge of fine-tuning multiplicity in Tabular LLMs and proposes a novel metric to quantify the robustness of individual predictions without expensive model retraining. Our metric quantifies a prediction's stability by analyzing (sampling) the model's local behavior around the input in the embedding space. Interestingly, we show that sampling in the local neighborhood can be leveraged to provide probabilistic robustness guarantees against a broad class of fine-tuned models. By leveraging Bernstein's Inequality, we show that predictions with sufficiently high robustness (as defined by our measure) will remain consistent with high probability. We also provide empirical evaluation on real-world datasets to support our theoretical results. Our work highlights the importance of addressing fine-tuning instabilities to enable trustworthy deployment of LLMs in high-stakes and safety-critical applications.

Problem

Research questions and friction points this paper is trying to address.

Quantifying prediction consistency in fine-tuned tabular LLMs

Addressing reliability concerns from conflicting model predictions

Measuring local stability to guarantee prediction consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Quantifies prediction consistency without retraining

Analyzes local model behavior in embedding space

Provides probabilistic guarantees on prediction stability

🔎 Similar Papers

No similar papers found.