NeurIPS 2023 LLM Efficiency Fine-tuning Competition

📅 2025-03-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies pervasive benchmark overfitting in the NeurIPS 2023 LLM Efficient Tuning Competition: top-performing methods suffer a 32% performance drop on closed-source test sets, revealing severe generalization fragility in current public-benchmark-driven evaluation. To address this, the authors introduce the first reproducible two-stage (public training + closed-source validation) evaluation framework, rigorously confirming benchmark overfitting as the dominant bottleneck. They propose a novel efficient tuning paradigm centered on *data refinement*—rather than model architecture or algorithmic innovation—as the primary lever for improvement. All competition submissions, standardized fine-tuning pipelines, and Dockerized evaluation infrastructure are open-sourced. Empirical analysis demonstrates that state-of-the-art performance stems predominantly from high-quality data curation, underscoring data governance as critical to LLM generalization. This work establishes both a methodological foundation and an empirical benchmark for rethinking evaluation paradigms in generative modeling.

Technology Category

Application Category

📝 Abstract
Our analysis of the NeurIPS 2023 large language model (LLM) fine-tuning competition revealed the following trend: top-performing models exhibit significant overfitting on benchmark datasets, mirroring the broader issue of benchmark overfitting on popular leaderboards and that data curation is essential in order to get a high performing LLM. The competition, which consisted of two stages - an open evaluation stage with publicly available tasks and a closed evaluation stage with unseen tasks - allowed us to assess the generalizability of fine-tuned LLMs. Our results highlight the limitations of current benchmark-based evaluation schemes for generative models and demonstrate the need for more robust evaluation methods. Notably, the winning submissions utilized standard open-source libraries and focused primarily on data curation. To facilitate further research and promote reproducibility, we release all competition entries, Docker files, and evaluation infrastructure, providing a valuable resource for the community to explore fine-tuning, overfitting, and reproducibility in LLMs.
Problem

Research questions and friction points this paper is trying to address.

Addresses overfitting in large language models on benchmark datasets.
Highlights limitations of current benchmark-based evaluation schemes.
Emphasizes the need for robust evaluation methods and data curation.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilized standard open-source libraries for efficiency
Focused on data curation to enhance performance
Released competition resources for reproducibility and research
🔎 Similar Papers
No similar papers found.
M
Mark-Albert Saroufim
Yotam Perlitz
Yotam Perlitz
IBM Research AI
Natural Language GenerationDomain AdaptationSemantics Evaluation
Leshem Choshen
Leshem Choshen
MIT, IBM AI research
Model RecyclingEvolving Collaborative PretrainingEvaluationModel MergingOpen the Black Box
L
L. Antiga
G
Greg Bowyer
C
Christian Puhrsch
D
Driss Guessous
S
Supriya Rao
G
Geeta Chauhan
A
Ashvini Kumar
J
Jindal Pawan Kumar
R
Rajpoot Ankur Parikh
J
Joe Isaacson
Weiwei Yang
Weiwei Yang
Microsoft Research, Caltech
Machine LearningBiological inspired modelsNeural Networks