Fine-tuning for Better Few Shot Prompting: An Empirical Comparison for Short Answer Grading

📅 2025-08-05

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This study investigates how fine-tuning enhances few-shot prompting performance in Automated Short Answer Grading (ASAG), particularly under real-world scarcity of labeled data. Methodologically, we compare closed-source OpenAI models against open-weight models—including Llama 3.1 8B-Instruct—employing QLoRA for efficient fine-tuning on consumer-grade GPUs, and integrate structured JSON few-shot prompts. A key innovation is the pre-filling of fine-tuning with low-cost synthetic data. Results show that fine-tuned OpenAI models substantially outperform few-shot prompting baselines; native fine-tuning of Llama yields limited gains, but incorporating synthetic data leads to significant performance improvements—modulated markedly by subject domain. This work is the first to systematically characterize the synergistic interplay among synthetic data quality, model openness (i.e., weight accessibility), and domain-specific characteristics in determining ASAG fine-tuning efficacy.

Technology Category

Application Category

📝 Abstract

Research to improve Automated Short Answer Grading has recently focused on Large Language Models (LLMs) with prompt engineering and no- or few-shot prompting to achieve best results. This is in contrast to the fine-tuning approach, which has historically required large-scale compute clusters inaccessible to most users. New closed-model approaches such as OpenAI's fine-tuning service promise results with as few as 100 examples, while methods using open weights such as quantized low-rank adaptive (QLORA) can be used to fine-tune models on consumer GPUs. We evaluate both of these fine-tuning methods, measuring their interaction with few-shot prompting for automated short answer grading (ASAG) with structured (JSON) outputs. Our results show that finetuning with small amounts of data has limited utility for Llama open-weight models, but that fine-tuning methods can outperform few-shot baseline instruction-tuned LLMs for OpenAI's closed models. While our evaluation set is limited, we find some evidence that the observed benefits of finetuning may be impacted by the domain subject matter. Lastly, we observed dramatic improvement with the LLama 3.1 8B-Instruct open-weight model by seeding the initial training examples with a significant amount of cheaply generated synthetic training data.

Problem

Research questions and friction points this paper is trying to address.

Evaluating fine-tuning vs few-shot prompting for short answer grading

Comparing closed and open-weight models' performance in ASAG

Assessing impact of synthetic data on open-weight model fine-tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuning LLMs with minimal data examples

Using QLORA for consumer GPU fine-tuning

Enhancing models with synthetic training data

🔎 Similar Papers

No similar papers found.