60 Data Points are Sufficient to Fine-Tune LLMs for Question-Answering

📅 2024-09-24

📈 Citations: 1

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This study investigates the feasibility and underlying mechanisms of supervised fine-tuning (SFT) for large language models (LLMs) under extremely low-data regimes. Focusing on question-answering tasks, we propose a data categorization framework grounded in pretraining knowledge memorization levels and systematically examine how data scale, type, and model family (Llama, Qwen, Phi) jointly influence SFT efficacy. Our key findings reveal that merely 60 high-quality samples suffice to effectively activate pretrained knowledge and approach full-dataset fine-tuning performance. We further uncover model-specific sensitivities to memorization-level categories: low-memorization data enhances generalization, whereas high-memorization data induces overfitting—demonstrating that optimal data composition is inherently model-dependent. These results establish both theoretical foundations and practical guidelines for efficient, interpretable few-shot adaptation of LLMs.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) encode extensive world knowledge through pre-training on massive datasets, which can then be fine-tuned for the question-answering (QA) task. However, effective strategies for fine-tuning LLMs for the QA task remain largely unexplored. To address this gap, we categorize supervised fine-tuning (SFT) data based on the extent of knowledge memorized by the pretrained LLMs and conduct a series of empirical analyses. Our experiments, involving four LLMs from three different model families, focus on three key factors: the amount of data required for SFT, the impact of different SFT datasets on model performance, and how data requirements vary across LLMs. The results show that as few as 60 data points during the SFT stage can activate the knowledge encoded during pre-training, enabling LLMs to perform the QA task. Additionally, SFT with data of varying memory levels has a significant impact on LLM performance, with the optimal dataset differing based on the specific model being fine-tuned. Future research will delve deeper into the mechanisms underlying these phenomena.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Limited Data Learning

Model Efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models

Learning Efficiency

Data Demand

🔎 Similar Papers

No similar papers found.

ByteDance

圣何塞

Member of Technical Staff - Post Training - MAI Superintelligence Team

Microsoft

$119,800 -

San Francisco Bay area / New York City metropolitan area

Authors to Follow