The Hyperfitting Phenomenon: Sharpening and Stabilizing LLMs for Open-Ended Text Generation

📅 2024-12-05
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the propensity of large language models (LLMs) to generate repetitive, monotonous, and incoherent long-range text under greedy decoding in open-ended generation. Contrary to conventional wisdom, we propose an effective, counterintuitive solution: near-zero-loss fine-tuning—i.e., deliberate “overfitting”—of pre-trained LLMs on extremely small datasets (e.g., dozens of samples). We are the first to formally define and empirically demonstrate overfitting in open-ended generation—distinct from grokking and double-descent—and show that low-entropy predictions significantly enhance sequence diversity and human preference scores. Through few-shot supervised fine-tuning, probabilistic entropy analysis, and rigorous human evaluation, we demonstrate that overfitted models outperform standard top-p sampling in coherence, diversity, and preference ratings. Moreover, these gains generalize across LLM scales, diverse text domains, and even autoregressive image generation tasks.

Technology Category

Application Category

📝 Abstract
This paper introduces the counter-intuitive generalization results of overfitting pre-trained large language models (LLMs) on very small datasets. In the setting of open-ended text generation, it is well-documented that LLMs tend to generate repetitive and dull sequences, a phenomenon that is especially apparent when generating using greedy decoding. This issue persists even with state-of-the-art LLMs containing billions of parameters, trained via next-token prediction on large datasets. We find that by further fine-tuning these models to achieve a near-zero training loss on a small set of samples -- a process we refer to as hyperfitting -- the long-sequence generative capabilities are greatly enhanced. Greedy decoding with these Hyperfitted models even outperform Top-P sampling over long-sequences, both in terms of diversity and human preferences. This phenomenon extends to LLMs of various sizes, different domains, and even autoregressive image generation. We further find this phenomena to be distinctly different from that of Grokking and double descent. Surprisingly, our experiments indicate that hyperfitted models rarely fall into repeating sequences they were trained on, and even explicitly blocking these sequences results in high-quality output. All hyperfitted models produce extremely low-entropy predictions, often allocating nearly all probability to a single token.
Problem

Research questions and friction points this paper is trying to address.

Enhancing text generation diversity
Stabilizing large language models
Overfitting on small datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hyperfitting enhances LLMs
Fine-tuning with small datasets
Low-entropy predictions improve generation
🔎 Similar Papers
No similar papers found.