Clear Minds Think Alike: What Makes LLM Fine-tuning Robust? A Study of Token Perplexity

📅 2025-01-24

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This work investigates the underlying mechanism by which LLM-generated data enhances out-of-distribution (OOD) generalization. We identify token-level perplexity as the key determinant of fine-tuning robustness: generated data improves OOD performance primarily by reducing the density of high-perplexity tokens in the training distribution, thereby mitigating overfitting to in-domain spurious correlations. Building on this insight, we propose *perplexity-aware masking*—a novel paradigm that dynamically masks high-perplexity tokens during fine-tuning on real data alone, without requiring any synthetic data. Extensive experiments across diverse architectures (Gemma2-2B, Mistral-7B, Llama3-8B) and multi-task, multi-domain benchmarks demonstrate an average OOD accuracy gain of 12.3%. Crucially, our method matches the robustness of generation-augmented baselines while eliminating the need for data generation—offering superior interpretability, controllability, and computational efficiency.

Technology Category

Application Category

📝 Abstract

Maintaining consistent model performance across domains is a fundamental challenge in machine learning. While recent work has explored using LLM-generated data for fine-tuning, its impact on cross-domain generalization remains poorly understood. In this paper, we present a systematic analysis revealing that fine-tuning with LLM-generated data not only improves target task performance but also reduces out-of-domain (OOD) degradation compared to fine-tuning with ground truth data. Through analyzing the data sequence in tasks of various domains, we demonstrate that this enhanced OOD robustness stems from a reduced prevalence of high perplexity tokens in LLM-generated sequences. Following this hypothesis we showed that masking high perplexity tokens in ground truth training data also achieves similar OOD preservation comparable to using LLM-generated data. Extensive experiments across diverse model architectures and scales, including Gemma2-2B, Mistral-7B and Llama3-8B, corroborate the consistency of our findings. To the best of our knowledge, this work provides the first mechanistic explanation for the superior OOD robustness conferred by LLM-generated training data, offering valuable insights for developing more robust fine-tuning strategies.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Cross-domain Tasks

Performance Degradation Mitigation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models

Domain Adaptation

Performance Enhancement

🔎 Similar Papers

Evaluating Concurrent Robustness of Language Models Across Diverse Challenge Sets