Do Generalisation Results Generalise?

📅 2025-12-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
It remains unclear whether a large language model’s (LLM) out-of-distribution (OOD) generalization performance on a single OOD dataset reliably reflects its true robustness under diverse distribution shifts. Method: We introduce the first systematic multi-OOD evaluation framework, leveraging partial correlation analysis to quantify inter-test-set generalization correlations while controlling for in-domain performance confounds. Results: Experiments across OLMo2 and OPT models reveal that OOD generalization exhibits no consistent positive or negative cross-test-set correlations; instead, correlation patterns are highly dependent on specific model–test-set pairings. This demonstrates the inadequacy of single-OOD benchmarks for assessing robust generalization. Our findings provide theoretical grounding and methodological support for developing more reliable, reproducible, and comprehensive generalization evaluation protocols—highlighting the necessity of multi-OOD assessment to uncover latent failure modes and ensure meaningful robustness estimation.

Technology Category

Application Category

📝 Abstract
A large language model's (LLM's) out-of-distribution (OOD) generalisation ability is crucial to its deployment. Previous work assessing LLMs' generalisation performance, however, typically focuses on a single out-of-distribution dataset. This approach may fail to precisely evaluate the capabilities of the model, as the data shifts encountered once a model is deployed are much more diverse. In this work, we investigate whether OOD generalisation results generalise. More specifically, we evaluate a model's performance across multiple OOD testsets throughout a finetuning run; we then evaluate the partial correlation of performances across these testsets, regressing out in-domain performance. This allows us to assess how correlated are generalisation performances once in-domain performance is controlled for. Analysing OLMo2 and OPT, we observe no overarching trend in generalisation results: the existence of a positive or negative correlation between any two OOD testsets depends strongly on the specific choice of model analysed.
Problem

Research questions and friction points this paper is trying to address.

Assesses LLM out-of-distribution generalization across multiple datasets
Investigates if generalization results are consistent when controlling for in-domain performance
Finds generalization correlations vary by model without an overarching trend
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates multiple out-of-distribution datasets for generalization
Uses partial correlation to control for in-domain performance
Analyzes trends across different models like OLMo2 and OPT
🔎 Similar Papers
2024-06-25arXiv.orgCitations: 0